We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.
翻译:我们用一种基于梯度的下行算法解决找到一个限制的Markov决定程序的最佳政策的问题。 先前的结果显示, 原始双向方法可以达到美元=mathcal{O} (1/\\\ sqrt{T}) 美元=全球最佳差值和限制违反情况的全球趋同率。 我们建议采用一种新的算法, 称为政策镜下位双向双向( PMD- PD) 算法( PMD- PD), 它可以更快地达到美元=mathcal{O}(log(T)/ T) $=最佳差值和限制违反情况之间的趋同率。 在原始( 政策) 更新方面, PMD- PD 算法使用经修改的拉格兰特乘法, 以确保更快的趋同率。 我们还将这种方法的两个延伸至零限制违规和抽样估计情况。 实验结果显示PMD- PD算法的趋同比现行政策梯值更快的趋同率和更好的表现。