Offline reinforcement learning (RL) shows promise of applying RL to real-world problems by effectively utilizing previously collected data. Most existing offline RL algorithms use regularization or constraints to suppress extrapolation error for actions outside the dataset. In this paper, we adopt a different framework, which learns the V-function instead of the Q-function to naturally keep the learning procedure within the support of an offline dataset. To enable effective generalization while maintaining proper conservatism in offline learning, we propose Expectile V-Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning. Further, we introduce implicit planning along offline trajectories to enhance learned V-values and accelerate convergence. Together, we present a new offline method called Value-based Episodic Memory (VEM). We provide theoretical analysis for the convergence properties of our proposed VEM method, and empirical results in the D4RL benchmark show that our method achieves superior performance in most tasks, particularly in sparse-reward tasks.
翻译:离线强化学习(RL) 显示了通过有效利用先前收集的数据将RL应用到现实世界问题的希望。 大多数现有的离线RL算法使用正规化或限制来抑制数据集之外行动的外推错误。 在本文中,我们采用了不同的框架,通过学习V函数而不是Q函数来自然地将学习程序保留在离线数据集的辅助之下。为了在离线学习中保持适当的保守性,能够有效地概括化,同时在离线学习中保持适当的保守性,我们提议期望V-L学习(EVL),这在最佳价值学习和行为克隆之间可以顺利地相互调试。此外,我们引入了沿离线轨轨线的隐含规划,以加强学习的V值并加速趋同。我们一起提出了一个新的离线方法,即基于价值的Episodic Memory(VEM) 。我们为我们拟议的VEM方法的趋同性提供了理论分析,D4RL基准中的经验结果显示,我们的方法在大多数任务中,特别是在稀有的任务中取得了优异的成绩。