Markov Decision Processes are classically solved using Value Iteration and Policy Iteration algorithms. Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization, such as gradient ascent. Among these, a popular algorithm is the Natural Policy Gradient, which is a mirror descent variant for MDPs. This algorithm forms the basis of several popular Reinforcement Learning algorithms such as Natural actor-critic, TRPO, PPO, etc, and so is being studied with growing interest. It has been shown that Natural Policy Gradient with constant step size converges with a sublinear rate of O(1/k) to the global optimal. In this paper, we present improved finite time convergence bounds, and show that this algorithm has geometric (also known as linear) asymptotic convergence rate. We further improve this convergence result by introducing a variant of Natural Policy Gradient with adaptive step sizes. Finally, we compare different variants of policy gradient methods experimentally.
翻译:Markov 决策程序是使用价值迭代和政策迭代算法来典型地解决的。 最近对加强学习的兴趣激发了对优化所启发的方法的研究,例如梯度升降。 其中,一种流行的算法是自然政策梯度,这是MDPs的镜像下降变体。这种算法构成了若干受欢迎的加强学习算法的基础,例如自然行为方-cistic、TRPO、PPPO等,因此正在以越来越多的兴趣进行研究。已经表明,自然政策梯度以恒定的步级尺大小与O(1/k)的子线性速率趋同到全球最佳。在本文中,我们提出了改进的有限时间趋同界限,并表明这种算法具有几何(也称为线性)等同度趋同率。我们通过采用适应性步骤大小的自然政策梯度变法来进一步改进这种趋同结果。最后,我们实验了政策梯度方法的不同变量。