Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator: the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.
翻译:由于不断开发更好的政策梯度技术,强化机器人的学习方法越来越成功。精确(低差异)和精确(低偏差)梯度估计值对于面对日益复杂的任务至关重要。传统的政策梯度算法使用概率比差技巧,据知它会产生不偏颇但高差异估计值。更现代的方法利用重新校正技巧,它提供较低的差异梯度估计值,但需要不同的价值函数比重。在这项工作中,我们研究不同类型的随机梯度估计值:测量值参数。这个测算器是公正的,差异小,可以与不同和不区别的函数比对使用。我们从经验上评价了在演艺-跨度政策梯度设置中的这一估计值,并表明它可以用基于概率比对差或再校准技巧的方法在低维和高维行动空间达到可比较的性效果。