The world currently offers an abundance of data in multiple domains, from which we can learn reinforcement learning (RL) policies without further interaction with the environment. RL agents learning offline from such data is possible but deploying them while learning might be dangerous in domains where safety is critical. Therefore, it is essential to find a way to estimate how a newly-learned agent will perform if deployed in the target environment before actually deploying it and without the risk of overestimating its true performance. To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE) to estimate the performance of offline policies during learning. In our setting, we assume a source of data, which we split into a train-set, to learn an offline policy, and a test-set, to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment, and therefore allows us to decide when to deploy our learned policy.
翻译:目前,世界在多个领域提供了大量数据,我们可以从中学习强化学习(RL)政策,而无需与环境进一步互动。RL代理商可以从这些数据中进行脱线学习,但是在安全至关重要的领域,在学习时部署它们可能很危险。因此,必须找到一种方法,估计新学的代理商在实际部署之前在目标环境中部署时如何表现,而不会过高估计其真实表现的风险。为了实现这一点,我们引入了一个安全评估脱线学习的框架,利用大约高信任的离政策评价(HCOPE)来估计脱线政策在学习期间的绩效。在我们所处的环境中,我们假设了一种数据来源,我们分成一个列队,学习离线政策和测试集,以便用靴子进行离线评价来估计离线政策的下限。一个更低的估计数告诉我们,新学的目标政策在实际环境中部署之前将如何良好,从而使我们能够决定何时部署我们所学的政策。