Offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets, without interacting with the underlying environment during the learning process. A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy. To avoid the detrimental impact of distribution mismatch, we regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process. Further, we train a dynamics model to both implement this regularization and better estimate the stationary distribution of the current policy, reducing the error induced by distribution mismatch. On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm. The code is publicly available.
翻译:离线强化学习(RL)将经典RL算法的范式扩大到纯从静态数据集中学习,在学习过程中不与基本环境互动。离线RL的关键挑战在于政策培训的不稳定性,因为离线数据分布与所学政策的未贴现固定状态分布不匹配,造成政策培训的不稳定性。为了避免分配不匹配的有害影响,我们在政策优化过程中将当前政策向离线数据的非贴现固定分布规范化。此外,我们培训一个动态模型,以实施这一规范化,并更好地估计当前政策的固定分布,减少分配不匹配引起的错误。在广泛的连续控制离线RL数据集方面,我们的方法显示竞争的绩效,这验证了我们的算法。代码是公开提供的。