In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g. generative modeling, uncertainty estimation, planning etc.) to directly translate into advances for offline RL.
翻译:在离线强化学习(RL)中,目标是学习一种完全基于历史与环境互动的数据集的高度有益的政策。脱线培训RL政策的能力可以极大地扩大RL的可应用性、数据效率及其实验速度。在离线RL的先前工作几乎仅限于无模型的RL方法。在这项工作中,我们提出了模型离线RL的算法框架。这个框架包括两个步骤:(a) 利用离线数据集学习悲观的MDP(P-MDP) ;以及(b) 在P-MDP中学习接近最佳的RL政策。学习过的P-MDP具有任何政策、实际环境中的绩效都几乎局限于无模型的RL方法。在P-MDP的绩效中,我们提供了模型评估和学习的模型模型模型模型模型模型模型模型模型模型,克服了基于模型的进度的常见缺陷。理论上,我们表明MOReL是最小型的ML政策最优化的(到日志因素),在离线的模型设计中,我们展示了离线的RL的将来的进度。