Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator - the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces. With this work, we want to show that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators.
翻译:由于不断开发更好的政策梯度技术,强化机器人的学习方法越来越成功。精确(低差异)和精确(低偏差)梯度估计值对于面对日益复杂的任务至关重要。传统的政策梯度算法使用概率-斜度技巧,据知它会产生不偏向但高差异估计值。更现代的方法利用重新校正技巧,它提供较低的差异梯度估计值,但需要不同的价值功能。在这项工作中,我们研究一种不同类型的随机梯度估计值—— 计量- 估价衍生工具。这个估测器是公正的, 差异小, 并且可以与不同和不区别的函数相近者一起使用。我们从经验上评估了这个在行为方- 偏差政策梯度设置中的估计值,并表明它可以用基于概率- 梯度或重新校正功能的方法在低和高空间达到可比较的性能。在这项工作中,我们想显示测量的测度调值估测仪可以是其他政策梯度的有用替代方法。