We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
翻译:我们提出一个新的政策梯度方法,名为同质政策镜底(HPMD ), 用于解决具有有限状态和行动空间的、 折扣的、 无限的地平线 MDP 。 HPMD 执行一个镜色的下层政策更新, 并额外减少正规化条件, 并拥有文献中似乎新的几种计算属性。 我们首先建立HPMD 与 Kullback- Leiber 差异的全球线性融合, 包括最佳性差, 以及与一套最佳政策之间的加权距离。 然后, 在没有任何假设的情况下, 获得两个数量的地方超级线性趋同。 随着本地加速和不断减少的正规化, 我们发现在验证和定性限制政策时的政策梯度方法的第一个结果, 以非默认性的方式显示, 最后, 将HPMDMD 与最优化政策相趋同, 然后将所有上述结果推广到 HPMDMD, 以广泛类别分解的美元差异, 表明这些计算属性的普遍性。 作为产品, 我们发现, 我们发现一个固定的精确的趋近时间的趋近的趋近的趋近的趋近的趋同性趋同性, 。