Approximation of the value functions in value-based deep reinforcement learning systems induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control. By obtaining fixed weights in computing the critic objective as a linear combination of the approximate critic functions, our Q-value update rule integrates the concepts of Clipped Double Q-learning and Maxmin Q-learning. We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks and find that it improves the state-of-the-art and outperforms the baseline algorithms in the majority of the environments.
翻译:在基于价值的深强化学习系统中,对价值值值功能的近似度估计导致高估偏差,从而形成次优政策。我们表明,当代理人收到的加固信号存在很大差异时,克服高估偏差的深层次行为者-批评方法导致严重低估偏差。我们引入了一个无参数的、新的深层次Q-学习变量,以减少这种对持续控制的低估偏差。通过在计算评论家目标时获得固定的加权值,将其作为近似评论员功能的线性组合,我们的Q-价值更新规则结合了Claped双Q学习和Maxmin Q-学习的概念。我们测试了我们在一套 MujoCo和Box2D连续控制任务方面的改进表现,发现它改进了大多数环境中的状态,超越了基线算法。