Over the recent years, vast progress has been made in Offline Reinforcement Learning (Offline-RL) for various decision-making domains: from finance to robotics. However, comparing and reporting new Offline-RL algorithms has been noted as underdeveloped: (1) use of unlimited online evaluation budget for hyperparameter search (2) sidestepping offline policy selection (3) ad-hoc performance statistics reporting. In this work, we propose an evaluation technique addressing these issues, Expected Online Performance, that provides a performance estimate for a best-found policy given a fixed online evaluation budget. Using our approach, we can estimate the number of online evaluations required to surpass a given behavioral policy performance. Applying it to several Offline-RL baselines, we find that with a limited online evaluation budget, (1) Behavioral Cloning constitutes a strong baseline over various expert levels and data regimes, and (2) offline uniform policy selection is competitive with value-based approaches. We hope the proposed technique will make it into the toolsets of Offline-RL practitioners to help them arrive at informed conclusions when deploying RL in real-world systems.
翻译:近年来,从金融到机器人等不同决策领域的离线强化学习(离线-RL)取得了巨大进展。然而,比较和报告新的离线-RL算法被认为是欠发达的:(1) 使用无限制的在线评价预算进行超光谱搜索(2) 边距离线政策选择(3) 特别动态业绩统计报告。在这项工作中,我们提出了一个解决这些问题的评价技术,即预期在线业绩,根据固定的在线评价预算,为最完善的政策提供业绩估计。我们可以使用我们的方法,估计超过特定行为政策绩效所需的在线评价数量。我们发现,在有限的离线-RL基线下,我们发现在有限的在线评价预算下,(1) 行为性克隆是各种专家级别和数据制度的有力基线,(2) 离线性统一政策选择与基于价值的方法相比具有竞争力。我们希望,拟议的技术将将其纳入离线-RL从业人员的工具库,帮助他们在实际世界系统中部署RL时得出知情的结论。