Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics, which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. ACC further achieves improved results on different tasks from the Meta-World robot benchmark. Additionally, we demonstrate the generality of ACC by applying it to TD3 and showing an improved performance also in this setting.
翻译:精确值估计值对于政策外强化学习很重要。 基于时间差异学习的算法通常会随着时间推移而形成一种过度或低估的偏差。 在本文中,我们提出一种通用方法,即“适应性校准分析器”(ACC),该方法使用最新高差异但不带偏见的政策推出方式来减轻低差异时间差目标的偏差。我们将ACC应用到被调整的量子计算器,它是一种连续控制算法,允许以超光谱调环境来调节偏差。由此产生的算法在培训期间对参数进行了适应性调整,使超光谱搜索变得没有必要,并在OpenAI体操连续控制基准上设置了不为每个环境调音调超参数的所有算法的新状态。ACC进一步改进了Meta-World机器人基准的不同任务的成果。此外,我们通过将它应用于TD3并显示这一环境的改进性能来显示ACC的普遍性。