MOPO:基于模型的离线政策优化 (MOPO: Model-based Offline Policy Optimization)

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. The code is available at https://github.com/tianheyu927/mopo.

翻译：离线加固学习(RL)是指完全从大量先前收集的数据中学习政策的问题。这一问题的确定提供了利用这种数据集获得政策而不花费任何费用或危险的积极探索的前景。然而,由于离线培训数据与学习政策访问的国家之间的分配转移,这也具有挑战性。尽管最近取得了显著进展,最成功的先前方法是无模式的,并限制了政策对数据的支持,使数据无法概括化。在本文件中,我们首先看到基于模型的RL算法已经在离线设置方面与无模式方法相比产生了重大收益。然而,为在线设置设计的标准基于模型的RL方法并没有提供明确的机制来避免离线培训培训数据与通过学习政策访问访问访问的国家之间的分配变化。相反,我们提议修改现有的基于模型的RL方法,人为地惩罚了动态的不确定性。我们理论上表明,在真正的 MDP下,这种算法最大限度地降低了政策回报的稳定性。我们还将基于模型的收益和风险区分在离线上,在离线的 RMOL 标准,我们要求在离线的SD-L AS-SL AS- dal dal dal dal dal dal dalgal 上,我们要求在离Sm- dal-al-d-dalgalgalgalgalgalxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx的利的利交易和前的利交易与前的利交易与在离前的进度差差差差差差差差和风险之间。我们Slxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。