In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.
翻译:在安全强化学习(SRL)问题中,一个代理商探索环境以最大限度地提高预期总报酬,同时避免违反对若干预期总成本的某些限制。一般来说,这种SRL问题具有非对等目标功能,受到多种非对等限制的限制,因此非常难以解决,特别是提供全球最佳政策。许多受欢迎的SRL算法采用原始双向结构,利用更新双重变量来满足限制。相比之下,我们提议一种初步方法,称为约束性调整政策优化(CRPO),在目标改进和约束性满意度之间交替更新政策。CRPO提供一种原始型算法框架,以解决SRL问题,其中每项政策更新都可采用任何政策优化步骤的变式。为了展示CROPO的理论性表现,我们采用了自然政策梯度(NPG),以更新双轨变量满足限制。我们建议CROPO(1/Srccilated) 能够(1/\qrate) 与全球最佳政策设定的节制政策相匹配率($/macalalalalal {O} 初步逻辑分析,以约束性全球最佳的MRitaltialeximaltialeximexadexadex。