减少持续控制中的估计比量 (Parameter-Free Deterministic Reduction of the Estimation Bias in Continuous Control)

Approximation of the value functions in value-based deep reinforcement learning systems induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control. By obtaining fixed weights in computing the critic objective as a linear combination of the approximate critic functions, our Q-value update rule integrates the concepts of Clipped Double Q-learning and Maxmin Q-learning. We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks and find that it improves the state-of-the-art and outperforms the baseline algorithms in the majority of the environments.

翻译：在基于价值的深强化学习系统中,对价值值值功能的近似度估计导致高估偏差,从而形成次优政策。我们表明,当代理人收到的加固信号存在很大差异时,克服高估偏差的深层次行为者-批评方法导致严重低估偏差。我们引入了一个无参数的、新的深层次Q-学习变量,以减少这种对持续控制的低估偏差。通过在计算评论家目标时获得固定的加权值,将其作为近似评论员功能的线性组合,我们的Q-价值更新规则结合了Claped双Q学习和Maxmin Q-学习的概念。我们测试了我们在一套 MujoCo和Box2D连续控制任务方面的改进表现,发现它改进了大多数环境中的状态,超越了基线算法。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【2021新书】国际象棋神经网络，268页pdf

专知会员服务

31+阅读 · 2021年10月4日

【经典书】强化学习算法，98页pdf

专知会员服务

130+阅读 · 2021年8月25日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

111+阅读 · 2020年5月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日