We study sequential decision making problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we use computational experiments to showcase the merits and the effectiveness of our approach.
翻译:我们研究连续决策问题,目的是在满足预期总效用的限制的同时,最大限度地提高预期总报酬,同时满足预期总效用的限制。我们使用自然政策梯度方法来解决限制的马尔科夫决策进程(限制的 MDPs)的折扣无限正方位最佳控制问题。具体地说,我们建议采用新的自然政策梯度微量纯度法(NPG-PD)方法,通过自然政策梯度梯度更新原始变量,通过预测的次梯度梯度法更新双向变量。虽然基本最大化涉及非平衡目标功能和非对流制约设置,但在软式政策准位化下,我们证明我们的方法在最佳性差距和限制违规两方面都实现了全球与亚线率的趋同。这种趋同与国家行动空间的大小无关,也就是说,它没有偏差。此外,对于逻辑-线性和一般平滑度差度差度梯度梯度,我们建立了亚线性趋同率率率率率,到功能近差因政策准化造成的误差。我们还提供了趋同和定点复杂度方法,我们还为基于我们样品的NPGS-PA的检验方法提供了趋同性和精确性保证。