The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.
翻译:考虑了限制的Markov决策程序的问题;一个代理机构的目的是在成本受到多种限制的情况下最大限度地增加预期累积的贴现奖励(限制数量相对较少);提出了一种新的双重办法,结合了两个要素:变本加厉的正常政策优化器和Vaidya的双重优化器,这两个要素对于更快地实现趋同至关重要;提出了拟议方法的有限时间错误约束;尽管非凝聚目标受到非凝聚的限制,但拟议的方法证明(以线性速度)与全球最佳办法趋同(以线性速度),其复杂性表现为最佳性差距和违反限制规定的情况在现有的原始双方法上大为改进。