We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces where the goal is to maximize the expected cumulative reward subject to some constraints. We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class. We even improve the sample complexity of existing constrained NPG-PD algorithm \cite{Ding2020} from $\mathcal{O}(1/\epsilon^6)$ to $\mathcal{O}(1/\epsilon^4)$. To the best of our knowledge, this is the first work to establish zero constraint violation with Natural policy gradient style algorithms for infinite horizon discounted CMDPs. We demonstrate the merits of proposed algorithm via experimental evaluations.
翻译:在连续的州行动空间,我们考虑限制Markov决策程序的问题,其目标是在一定的限制下最大限度地增加预期的累积报酬。我们建议采用新的保守自然政策Primal-Dual Algorithm (C-NPG-PD) 来达到零限制违反,同时为客观价值功能实现最先进的趋同结果。关于一般政策平衡,我们证明价值功能与全球最佳的趋同,与受限制的政策类别造成的近距离差差差差差。我们甚至提高了现有的受限制的NPG-PD运算法的样本复杂性。我们通过实验性评估展示了拟议算法的优点。