Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -- an algorithmic scheme that encourages exploration -- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops $\textit{non-asymptotic}$ convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -- or even quadratically once it enters a local region around the optimal policy -- when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
翻译:自然政策梯度(NPG)方法是当代强化学习中最广泛使用的政策优化算法之一。这一类方法通常与加密正规化(一种鼓励探索的算法办法)一起使用,并且与软性政策迭代和信任区域政策优化密切相关。尽管取得了经验成功,但即使对表格设置而言,NPG方法的理论基础仍然有限。本文在软式参数化(重点是贴现的Markov决策程序)下为加密常规化的NPG方法开发了美元(textit{non-astytisty}$的趋同保证。我们假设了精确的政策评估,我们证明当该算法在计算正规化的MDP的最佳价值功能时,它会直线地 -- -- 甚至是在进入一个当地最佳政策区域时 -- -- 汇合 -- 。此外,该算法相对于政策评价的不灵敏度来说是相当稳定的。我们的趋同结果可以容纳广泛的学习率,并揭示了在快速趋同方面英式正规化的作用。