This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. At the same time, large amount of online interactions is often not feasible in practice. To overcome this problem, we introduce \emph{active offline policy selection} -- a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely, it relies on a Bayesian optimization method, with a kernel function that represents policy similarity, to decide which policy to evaluate next. We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.
翻译:本文探讨了在有大量登录数据但互动预算非常有限的领域进行政策选择的问题。 解决这个问题有助于在工业、机器人和建议等领域安全评估和部署离线强化学习政策。 提出了几种离线评估技术,以利用仅登录数据评估政策的价值。 但是,在OPE的评价和实际环境中的全面在线评价之间仍然存在着巨大差距。 与此同时,大量在线互动在实践中往往不可行。 为了解决这一问题,我们引入了\emph{active offline 政策选择} -- -- 一种新型的顺序决策方法,将登录的数据与在线互动结合起来,以确定最佳政策。这个方法利用OPE估计数来暖化在线评价。 然后,为了明智地利用有限的环境互动,它依靠一种巴耶斯优化方法,其内核功能代表政策相似性,决定下一个评价政策。 我们使用多种基准与大量候选政策来显示拟议方法在OPE的状态、专业估计和纯粹在线政策评价方面有所改进。