Offline reinforcement learning (RL) aims at learning an optimal policy from a batch of collected data, without extra interactions with the environment during training. Offline RL attempts to alleviate the hazardous executions in environments, thus it will greatly broaden the scope of RL applications. However, current offline RL benchmarks commonly have a large reality gap. They involve large datasets collected by highly exploratory policies, and a trained policy is directly evaluated in the environment. Meanwhile, in real-world situations, running a highly exploratory policy is prohibited to ensure system safety, the data is commonly very limited, and a trained policy should be well validated before deployment. In this paper, we present a suite of near real-world benchmarks, NewRL. NewRL contains datasets from various domains with controlled sizes and extra test datasets for the purpose of policy validation. We then evaluate existing offline RL algorithms on NewRL. In the experiments, we argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. Because the deterministic behavior policy is the baseline in the real scenarios, while the dataset is often collected with action perturbations that can degrade the performance. The empirical results demonstrate that the tested offline RL algorithms appear only competitive to the above deterministic policy on many datasets, and the offline policy evaluation hardly helps. The NewRL suit can be found at http://polixir.ai/research/newrl. We hope this work will shed some light on research and draw more attention when deploying RL in real-world systems.
翻译:离线强化学习(RL)旨在从收集的一组数据中学习最佳政策,而无需在培训期间与环境进行额外互动。离线RL试图减轻环境中的危险处决,从而大大扩大RL应用的范围。然而,目前的离线RL基准通常存在巨大的现实差距。它们涉及由高度探索性政策收集的大量数据集,在环境中直接评价一项经过培训的政策。与此同时,在现实世界中,实施高度探索性政策以确保系统安全,数据通常非常有限,在部署之前,应当对经过培训的政策进行充分验证。在本文件中,我们提出了一套接近真实世界的基准,即NewRL。NewRL包含来自不同区域且有控制性大小的数据集,以及用于政策验证目的的额外测试数据集。我们随后对现有的离线的RL算法进行直接评估。在实验中,我们认为,一项政策的执行情况还应该与行为政策的确定性版本相比较,而不是对数据进行奖励。由于确定性的行为政策政策政策是真实情景的基线,新RRLL。当我们常常在模拟性政策上显示真实性的工作结果时,我们只能从许多经检验的结果。