This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and healthcare domain among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. To reduce this gap, we introduce a novel \emph{active offline policy selection} problem formulation, which combined logged data and limited online interactions to identify the best policy. We rely on the advances in OPE to warm start the evaluation. We build upon Bayesian optimization to iteratively decide which policies to evaluate in order to utilize the limited environment interactions wisely. Many candidate policies could be proposed, thus, we focus on making our approach scalable and introduce a kernel function to model similarity between policies. We use several benchmark environments to show that the proposed approach improves upon state-of-the-art OPE estimates and fully online policy evaluation with limited budget. Additionally, we show that each component of the proposed method is important, it works well with various number and quality of OPE estimates and even with a large number of candidate policies.
翻译:本文探讨了在有大量登录数据但互动预算非常有限的领域进行政策选择的问题。解决这个问题有助于在工业、机器人和保健等领域安全评估和部署离线强化学习政策。提出了几种非政策评价技术,以利用仅登录数据评估政策的价值。然而,在OPE的评价和实际环境中的全面在线评价之间仍然存在着巨大差距。为了缩小这一差距,我们引入了一种新型的cemph{active offline policy choice) 问题拟订,将登录数据和有限的在线互动结合起来,以确定最佳政策。我们依靠OPE的进展来温暖评估的开始。我们利用Bayesian优化,以迭接方式决定哪些政策评价,以便明智地利用有限的环境互动。许多候选政策可以提出,因此,我们侧重于使我们的方法具有可扩展性,并引入一种核心功能来模拟政策之间的相似性。我们使用几种基准环境来表明,拟议的方法改进了OPE的状态评估,以及完全在线政策评价。我们依靠OPE的进展来启动这项评估。我们利用Bayesian的进度来反复决定哪些政策,以便明智地利用OPE的每一项重要政策,我们展示了它的重要质量和数量。