Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective in-data policy optimization with conservative value guarantees. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
翻译:离线强化学习由于数据集和现行学习政策之间的分布性波动而面临价值过高估计的重大挑战,这导致了实践中的学习失败。通常的做法是在贝尔曼迭代中加入一个惩罚性术语,以奖励或估值。与此同时,为了避免对分配外状态和行动的外推,现有方法侧重于保守的功能估计。在本文中,我们建议采用保守的国家价值估计(CSVE),这是一种通过直接对OOD州施加惩罚来学习保守的V功能的新方法。与以往的工作相比,CSVE允许以保守的价值保障更有效地进行数据政策优化。此外,我们采用CSVE, 并开发一种实用的行为者逻辑方法,使评论者通过进一步抽样和惩罚各州对数据集进行保守的价值估计,以及行为者利用随着国家探索而扩展的加权更新来改进政策。我们评估了传统的D4RL州持续控制任务,显示我们的方法比保守的Q功能学习方法要好,而且最近SOTA方法具有很强的竞争力。