This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. We use OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel that represents policy similarity. We use multiple benchmarks, including real-world robotics, with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.
翻译:本文探讨了在有大量登录数据但互动预算有限的领域进行政策选择的问题。解决这个问题有助于在工业、机器人和建议等领域安全评估和部署离线强化学习政策。一些非政策评估技术建议仅使用登录数据评估政策的价值。然而,促进平等事务总理顾问办公室的评价与整个在线评价之间仍有巨大差距。然而,大量在线互动在实践中往往无法实现。为解决这一问题,我们引入了积极的离线政策选择 — — 一种将登录的数据与在线互动相结合的新颖的顺序决策方法,以确定最佳政策。我们利用促进平等事务顾问办公室的估计数来启动在线评价。然后,为了明智地利用有限的环境互动,我们决定了以体现政策相似性的贝叶斯优化方法来评估下一个政策。我们使用多种基准,包括真实世界的机器人,以及大量候选政策来表明拟议方法改进了POPE的最新估计数和纯粹的在线政策评价。