Offline reinforcement learning (RL) aims to find near-optimal policies from logged data without further environment interaction. Model-based algorithms, which learn a model of the environment from the dataset and perform conservative policy optimisation within that model, have emerged as a promising approach to this problem. In this work, we present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL. To achieve conservatism, we formulate the problem as a two-player zero sum game against an adversarial environment model. The model is trained minimise the value function while still accurately predicting the transitions in the dataset, forcing the policy to act conservatively in areas not covered by the dataset. To approximately solve the two-player game, we alternate between optimising the policy and optimising the model adversarially. The problem formulation that we address is theoretically grounded, resulting in a PAC performance guarantee and a pessimistic value function which lower bounds the value function in the true environment. We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that our approach achieves state of the art performance.
翻译:离线强化学习(RL)旨在从登录数据中找到接近最佳的政策,而没有进一步的环境互动。基于模型的算法,从数据集中学习环境模型,并在模型中实行保守的政策优化,这些算法作为解决这一问题的一个很有希望的办法出现。在这项工作中,我们提出了基于模型的离线强化学习(RL)的新颖方法,即基于模型的离线强化学习(RAMBO),这是对基于模型的离线强化学习的一种新办法。为了实现保守主义,我们把问题发展成一个对抗性环境模型的双球员零和游戏。模型经过培训后,将价值功能降到最低,同时仍然准确地预测数据集的转型,迫使该政策在数据集未覆盖的领域采取保守行动。为了大致解决双球游戏,我们将在优化政策与优化模式对抗性竞争之间作出交替选择。我们处理的问题提法是理论上的,结果是PAC性能保证和悲观价值函数,从而降低在真实环境中的数值功能。我们评估了在离线上广泛研究的RL基准方法,并展示我们实现艺术状态的方法。