深入非政策评价基准 (Benchmarks for Deep Off-Policy Evaluation)

Justin Fu,Mohammad Norouzi,Ofir Nachum,George Tucker,Ziyu Wang,Alexander Novikov,Mengjiao Yang,Michael R. Zhang,Yutian Chen,Aviral Kumar,Cosmin Paduraru,Sergey Levine,Tom Le Paine

from arxiv, ICLR 2021 paper. Policies and evaluation code are available at https://github.com/google-research/deep_ope

Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both evaluating and selecting complex policies for decision making. The ability to learn offline is particularly important in many real-world domains, such as in healthcare, recommender systems, or robotics, where online data collection is an expensive and potentially dangerous process. Being able to accurately evaluate and select high-performing policies without requiring online interaction could yield significant benefits in safety, time, and cost for these applications. While many OPE methods have been proposed in recent years, comparing results between papers is difficult because currently there is a lack of a comprehensive and unified benchmark, and measuring algorithmic progress has been challenging due to the lack of difficult evaluation tasks. In order to address this gap, we present a collection of policies that in conjunction with existing offline datasets can be used for benchmarking off-policy evaluation. Our tasks include a range of challenging high-dimensional continuous control problems, with wide selections of datasets and policies for performing policy selection. The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods. We perform an evaluation of state-of-the-art algorithms and provide open-source access to our data and code to foster future research in this area.

翻译：外部政策评价(OPE)有望利用大型离线数据集来评估和选择复杂的决策政策。在许多现实世界领域,例如保健、建议系统或机器人领域,学习离线能力特别重要,因为在线数据收集是一个昂贵和潜在危险的过程。能够准确评估和选择业绩高的政策而不要求在线互动,可以在安全、时间和成本方面产生巨大的效益。虽然近年来提出了许多离线数据,但比较文件之间的结果很困难,因为目前缺乏全面和统一的基准,而衡量算法进展由于缺乏困难的评估任务而具有挑战性。为了解决这一差距,我们收集了一套政策,与现有的离线数据集一起,可用于制定离线评价的基准。我们的任务包括一系列具有挑战性的高层次连续控制问题,并广泛选择数据集和政策来进行政策选择。我们基准的目标是提供一个标准化的进展衡量尺度,其动力是一套原则,旨在挑战并测试现有数据获取方法的局限性。我们利用了这一数据获取方法和数据获取方法,从而促进现有数据获取方法的公开领域。