This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce \emph{active offline policy selection} - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel function that represents policy similarity. We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.
翻译:本文探讨了在有大量登录数据但互动预算有限的领域进行政策选择的问题。解决这个问题有助于在工业、机器人和建议等领域安全评估和部署离线强化学习政策。提出了几种离线评估技术,以利用仅登录数据评估政策的价值。然而,在OPE的评价和实际环境中的全面在线评价之间仍然存在着巨大差距。然而,大量在线互动在实践中往往无法实现。为了解决这一问题,我们引入了\emph{主动离线政策选择}——一种新的顺序决策方法,将登录的数据与在线互动结合起来,以确定最佳政策。这一方法利用OPE的估计来启动在线评价。然后,为了明智地利用有限的环境互动,我们决定了以体现政策相似性的贝耶斯优化方法来评估哪项政策。我们使用多种基准和大量候选政策来表明拟议方法改进了OPE的状态评估和纯粹的在线政策评估。