Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
翻译:安全强化学习的目的是在满足安全限制的同时学习最佳政策,这是现实世界应用中必不可少的。然而,目前的算法仍然在以严格的约束性满意度努力争取有效的政策更新。在本文中,我们提议采用惩罚性极优政策优化(P3O),通过单一地尽量减少同等不受约束的问题来解决繁琐的限制性政策循环。具体地说,P3O利用简单而有效的惩罚功能消除成本限制,并消除被剪裁的代孕目标对信任区域的限制。我们理论上用一个有限的惩罚系数来证明拟议方法的准确性,并在对抽样轨迹进行评估时,对近似错误进行最坏的分析。此外,我们把P3O扩大到更具挑战性的多节制和多试剂情景,而以前的工作研究较少。广泛的实验表明,P3O在奖励改进和限制的机动性任务时,超越了最先进的算法。