Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics, which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. Additionally, we demonstrate that ACC is quite general by further applying it to TD3 and showing an improved performance also in this setting.
翻译:精确值估算对于政策外强化学习很重要。 基于时间差异学习的算法通常会随着时间推移而形成一种过度或低估的偏差。 在本文中,我们提议了一种通用方法,名为“适应性校准批评(ACC) ”, 使用最新的高差异,但政策推出时不带偏见,以缓解低差异时间差异目标的偏差。 我们将ACC应用到被调整的量子细胞中, 这是一种连续控制算法, 允许以超分计调环境来调节偏差。 由此产生的算法在训练期间对参数进行了适应性调整, 使超分计搜索变得没有必要, 并在OpenAI 健身房连续控制基准上设置了新的艺术状态, 不为每个环境调制超参数的所有算法。 此外,我们证明ACC非常笼统, 进一步应用TD3, 并在这一环境下显示性能的改善。