Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.
翻译:政策外深层强化学习算法通常通过利用对预期目标回报的悲观估计来弥补时间差异学习期间高估偏差。 在这项工作中,我们提议采用普遍悲观学习(GPL)这一战略,采用新颖的可学习惩罚来推行这种悲观主义。 特别是,我们提议与批评者一起学习双轨TD学习,这是一个用微小的计算成本来估计和尽量减少目标回报偏差程度的新程序。 GPL使我们能够准确地抵消整个培训期间高估偏差,而不会引起过于悲观的目标的下行。 通过将GPL与流行的离政策算法相结合,我们实现了竞争性的自我认知和像素基准的最先进的结果。</s>