以模型为基础的反向想象模型的离线强化强化学习 (Offline Reinforcement Learning with Reverse Model-based Imagination)

In offline reinforcement learning (offline RL), one of the main challenges is to deal with the distributional shift between the learning policy and the given dataset. To address this problem, recent offline RL methods attempt to introduce conservatism bias to encourage learning on high-confidence areas. Model-free approaches directly encode such bias into policy or value function learning using conservative regularizations or special network structures, but their constrained policy search limits the generalization beyond the offline dataset. Model-based approaches learn forward dynamics models with conservatism quantifications and then generate imaginary trajectories to extend the offline datasets. However, due to limited samples in offline dataset, conservatism quantifications often suffer from overgeneralization in out-of-support regions. The unreliable conservative measures will mislead forward model-based imaginations to undesired areas, leading to overaggressive behaviors. To encourage more conservatism, we propose a novel model-based offline RL framework, called Reverse Offline Model-based Imagination (ROMI). We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset. These reverse imaginations provide informed data augmentation for the model-free policy learning and enable conservative generalization beyond the offline dataset. ROMI can effectively combine with off-the-shelf model-free algorithms to enable model-based generalization with proper conservatism. Empirical results show that our method can generate more conservative behaviors and achieve state-of-the-art performance on offline RL benchmark tasks.

翻译：在离线强化学习(离线 RL)中,主要挑战之一是应对学习政策和给定数据集之间的分布变化。为了解决这一问题,最近的离线RL方法试图引入保守主义偏差,鼓励在高信任地区学习。无模型的方法直接将这种偏差纳入政策或价值函数学习,使用保守的规范或特殊的网络结构,但限制的政策搜索限制了离线数据集以外的一般化。基于模型的方法学习了具有保守主义量化的远方动态模型,然后产生了假想的扩展离线数据集的轨迹。然而,由于离线数据集的样本有限,保守主义量化方法往往会因支持外地区的过度概括化而受到影响。不可靠的保守措施将把基于模型的想象力误导到不理想的地区,导致过度进化行为。为了鼓励更多的保守主义,我们提议了一个基于模型的离线离线模型基础框架,称为“逆向离线模型”的离线模型模型化方法可以产生基于离线的离线数据集。我们在离线的离线的正轨的正轨模型模型模型模型模型中学习了一个反向反向反向反向反向方向推进的政策。