Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. We use synthetic and industrial-scale examples to demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
翻译:策略梯度算法近年来已经广泛应用于马尔可夫决策过程和强化学习问题。采用各种熵函数的正则化通常用于鼓励探索和提高稳定性。本文提出了一个牛顿近似方法用于带熵正则化的策略梯度算法。在香农熵的情况下,得出的算法重现了自然策略梯度算法。对于其他熵函数,这种方法得出了全新的策略梯度算法。我们证明了所有这些算法都享受Newton类型的二次收敛性,相应的梯度流全局地收敛至最优解。我们使用综合和工业规模的实例来证明了所提出的近似牛顿方法通常在单数次迭代中收敛,往往比其他最先进的算法快数个数量级。