It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL) such that the agent could execute proper actions instead of suboptimal ones. However, existing actor-critic methods suffer more or less from underestimation bias or overestimation bias, which negatively affect their performance. In this paper, we reveal a simple but effective principle: proper value correction benefits bias alleviation, where we propose the generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation. Particularly, we integrate the generalized-activated weighting operator into value estimation and introduce a novel algorithm, Generalized-activated Deep Double Deterministic Policy Gradients (GD3). We theoretically show that GD3 is capable of alleviating the potential estimation bias. We interestingly find that simple activation functions lead to satisfying performance with no additional tricks, and could contribute to faster convergence. Experimental results on numerous challenging continuous control tasks show that GD3 with task-specific activation outperforms the common baseline methods. We also uncover a fact that fine-tuning the polynomial activation function achieves superior results on most of the tasks.
翻译:准确估计深强化学习(DRL)的值值功能至关重要,因为代理商可以执行适当的行动而不是次优的行动。然而,现有的行为体批评方法或多或少地受到低估偏差或高估偏差的影响,从而对其业绩产生不利影响。在本文中,我们揭示了一个简单但有效的原则:正确的价值纠正有利于减轻偏差,我们在此建议一个使用任何非降级功能(即激活功能,即启动功能,作为更好地估计价值的权重)的通用权重操作员。特别是,我们将普遍激活权重操作操作员纳入价值估算,并引入新的算法(GD3),从理论上说,GD3有能力减轻潜在的估计偏差。我们很有意思地发现,简单的激活功能能够以不增加的技巧满足业绩,有助于更快的趋同。许多具有挑战性的连续控制任务的实验结果显示,GD3与特定激活功能的GD3超越了共同的基准方法。我们还发现,微调了最高级的聚点激活功能能够取得高超结果。