Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
翻译:许多受欢迎的强化学习政策梯度方法采用了一种偏差近似于称为贴现近似的政策梯度方法,虽然已经表明政策梯度的贴现近似值不是任何客观功能的梯度,但几乎没有人知道其趋同行为或特性。 在本文件中,我们表明,如果采用贴现近度,贴现系数以与学习率下降有关的速率缓慢增长,由此产生的方法将恢复不贴现目标的梯度标准保障。