Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at https://github.com/RL-boxes/Safe-RL.
翻译:安全强化学习(RL)仍然非常具有挑战性,因为它要求代理商既考虑回报最大化,又考虑安全探索。在本文中,我们提议CUP,即具有理论安全保障的保守更新政策算法。我们根据新的拟议性能约束和代理功能来计算CUP。虽然在有些现有工程中使用了替代功能来设计安全的RL算法,但我们至少开发了三个方面:(一) 我们提供严格的理论分析,将代理功能扩大到普遍优势估计器(GAE),以将代理功能扩大到普遍优势估计器(GAE)。GAE大量减少经验上的差异,同时保持不可容忍的偏差水平,这是我们设计CUP的一个有效步骤;(二) 拟议的界限比现有工程更加严格,即将拟议的界限用作替代功能更符合目标和安全限制的地方近似。 (三) 拟议的CUP通过一级优化器提供非convex执行功能,并不取决于任何convex准准。最后,广泛的实验显示CRUP在CRUP(M)/L) 开启了源码源码。