Offline Reinforcement Learning has attracted much interest in solving the application challenge for traditional reinforcement learning. Offline reinforcement learning uses previously-collected datasets to train agents without any interaction. For addressing the overestimation of OOD (out-of-distribution) actions, conservative estimates give a low value for all inputs. Previous conservative estimation methods are usually difficult to avoid the impact of OOD actions on Q-value estimates. In addition, these algorithms usually need to lose some computational efficiency to achieve the purpose of conservative estimation. In this paper, we propose a simple conservative estimation method, double conservative estimates (DCE), which use two conservative estimation method to constraint policy. Our algorithm introduces V-function to avoid the error of in-distribution action while implicit achieving conservative estimation. In addition, our algorithm uses a controllable penalty term changing the degree of conservatism in training. We theoretically show how this method influences the estimation of OOD actions and in-distribution actions. Our experiment separately shows that two conservative estimation methods impact the estimation of all state-action. DCE demonstrates the state-of-the-art performance on D4RL.
翻译:离线强化学习在解决传统强化学习的应用挑战方面引起了很大的兴趣。离线强化学习在解决传统强化学习的应用挑战方面引起了很大的兴趣。离线强化学习在没有任何互动的情况下使用以前收集的数据集来培训代理商。为解决过度估计OOOD(分配外)行动的问题,保守估计对所有投入的价值较低。以前的保守估计方法通常难以避免OOD行动对Q价值估计的影响。此外,这些算法通常需要失去一些计算效率,才能实现保守估计的目的。在本文中,我们提出一种简单的保守估计方法,即双重保守估计(DCE),使用两种保守的估计方法来限制政策。我们的算法引入了V功能,以避免分配行动错误,同时隐含实现保守估计。此外,我们的算法使用了一种可控制的罚款术语来改变培训中保守主义的程度。我们理论上展示了这种方法如何影响OOD行动和分配行动的估计。我们的实验分别表明,两种保守的估计方法影响到所有州行动的估算。DCE展示了D4RL的状态表现。