We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, collected by other agents, we seek to obtain a (tight) lower bound on a policy's performance. This is crucial when deciding whether a given policy satisfies certain minimal performance/safety criteria before it can be deployed in the real world. To this end, we introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics. To form a conservative estimate of the policy's performance, HAMBO hallucinates worst-case trajectories that the policy may take, within the margin of the models' epistemic confidence regions. We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return. Finally, we discuss scalable variants of our approach based on Bayesian Neural Networks and empirically demonstrate that they yield reliable and tight lower bounds in various continuous control environments.
翻译:我们研究保守的离政策评价问题,因为根据其他代理人收集的环境相互作用离线数据集,我们寻求获得对政策业绩的(紧紧)下限。在确定某项政策是否满足某些最低绩效/安全标准之后,才能将其部署到现实世界时,这一点至关重要。我们为此引入了HAMBO,它以一个了解不确定性的过渡动态模式为基础。为了形成对政策绩效的保守估计,HAMBO将政策可能采取的最坏情况轨迹放在模型的缩写信任区范围内。我们证明,由此得出的COPE估计数是有效的下限,在正常情况下,显示它们与真实预期回报的趋同。最后,我们讨论基于拜斯神经网络的可缩放变量,并用经验证明,在各种连续的控制环境中,它们会产生可靠和紧凑的下限。</s>