Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.
翻译:自然政策梯度( NPG) 方法与 entropy 正规化, 取得了惊人的成功。 但是, 在功能近似化制度中, 它们的趋同特性和对 entropy 正规化的影响仍然难以找到。 在本文中, 我们建立了对 entropy 正规化 NPG 的有限时间趋同分析, 其线性函数近似值在软负负参数化之下。 特别是, 我们证明 enropy 正规化 NPG 平均满足 exucation} 条件, 并达到 $\ tde{O}( 1/ T) 的快速趋同率, 直至 正规化的 Markov 决策程序中的函数近似差。 这一趋同结果并不要求政策上的任何先验假设。 此外, 在温和系数和基矢量的温性常态条件下, 我们证明 普通的NPGPG 显示 empropic 显示 emph{ 线性趋同 至函数近差 。