Model-based methods provide an effective approach to offline reinforcement learning (RL). They learn an environmental dynamics model from interaction experiences and then perform policy optimization based on the learned model. However, previous model-based offline RL methods lack long-term prediction capability, resulting in large errors when generating multi-step trajectories. We address this issue by developing a sequence modeling architecture, Environment Transformer, which can generate reliable long-horizon trajectories based on offline datasets. We then propose a novel model-based offline RL algorithm, ENTROPY, that learns the dynamics model and reward function by ENvironment TRansformer and performs Offline PolicY optimization. We evaluate the proposed method on MuJoCo continuous control RL environments. Results show that ENTROPY performs comparably or better than the state-of-the-art model-based and model-free offline RL methods and demonstrates more powerful long-term trajectory prediction capability compared to existing model-based offline methods.
翻译:基于模型的方法为离线强化学习提供了有效的方法。它们从互动经验中学习环境动态模型,然后根据所学模型进行政策优化。然而,以往基于模型的离线RL方法缺乏长期预测能力,导致产生多步轨迹时出现大错误。我们通过开发一个序列建模结构来解决这一问题,即环境变换器,它能够产生基于离线数据集的可靠的长离子轨道。然后,我们提出一个新的基于模型的离线RL算法(ENTROPY),它能够学习ENvironment TRansex的动态模型和奖励功能,并进行离线 PolicY优化。我们评估了拟议的穆乔科连续控制RL环境的方法。结果显示,ENTROPY比目前基于模型的离线模型和无模型的离线方法可比较或更好,并展示比现有基于模型的离线方法更强大的长期轨迹预测能力。</s>