以模型为基础的反向想象模型的离线强化强化学习 (Offline Reinforcement Learning with Reverse Model-based Imagination)

In offline reinforcement learning (offline RL), one of the main challenges is to deal with the distributional shift between the learning policy and the given dataset. To address this problem, recent offline RL methods attempt to introduce conservatism bias to encourage learning in high-confidence areas. Model-free approaches directly encode such bias into policy or value function learning using conservative regularizations or special network structures, but their constrained policy search limits the generalization beyond the offline dataset. Model-based approaches learn forward dynamics models with conservatism quantifications and then generate imaginary trajectories to extend the offline datasets. However, due to limited samples in offline datasets, conservatism quantifications often suffer from overgeneralization in out-of-support regions. The unreliable conservative measures will mislead forward model-based imaginations to undesired areas, leading to overaggressive behaviors. To encourage more conservatism, we propose a novel model-based offline RL framework, called Reverse Offline Model-based Imagination (ROMI). We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset. These reverse imaginations provide informed data augmentation for model-free policy learning and enable conservative generalization beyond the offline dataset. ROMI can effectively combine with off-the-shelf model-free algorithms to enable model-based generalization with proper conservatism. Empirical results show that our method can generate more conservative behaviors and achieve state-of-the-art performance on offline RL benchmark tasks.

翻译：在离线加固学习(离线 RL)中,主要挑战之一是应对学习政策和给定数据集之间的分配变化。为了解决这一问题,最近的离线RL方法试图引入保守主义偏差,以鼓励在高信任地区学习。无模型方法直接将这种偏差纳入政策或价值函数学习,使用保守的规范或特殊的网络结构,但其有限的政策搜索限制了离线数据集以外的一般化。基于模型的方法学习了具有保守主义量化的远方动态模型,然后产生了假想的扩展离线数据集的轨迹。然而,由于离线数据集的样本有限,保守主义量化方法往往在高信任地区受到过度概括化的影响。不可靠的保守措施将把基于模型的想象力误导到不理想的地区,导致过度进化行为。为了鼓励更多的保守主义,我们提议了一个基于模型的离线离线通用基准框架,称为逆向离线模型的离线模型离线模型离线模型离线模型,产生基于离线的离线数据集。我们学习了一个反向向下流的动态动态模型模型模型模型,在离线的轨道上学习了正向反向反向方向定位的内,同时学习了向方向数据显示数据,使数据向向向向向向向向向向向向向向向方向展示一般数据,使数据向向向向向向向向向向向向向向向向向向向上展示的动态方向展示方向展示的动态方向展示的定位,使这些数据向向向向向向向向向向向向向向数据向数据向向向向向向向向向向向向向向上展示,使数据向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向上展示,使向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向向上方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向