安全证书和安全管制政策联合综合报告 (Joint Synthesis of Safety Certificate and Safe Control Policy using Constrained Reinforcement Learning)

Safety is the major consideration in controlling complex dynamical systems using reinforcement learning (RL), where the safety certificate can provide provable safety guarantee. A valid safety certificate is an energy function indicating that safe states are with low energy, and there exists a corresponding safe control policy that allows the energy function to always dissipate. The safety certificate and the safe control policy are closely related to each other and both challenging to synthesize. Therefore, existing learning-based studies treat either of them as prior knowledge to learn the other, which limits their applicability with general unknown dynamics. This paper proposes a novel approach that simultaneously synthesizes the energy-function-based safety certificate and learns the safe control policy with CRL. We do not rely on prior knowledge about either an available model-based controller or a perfect safety certificate. In particular, we formulate a loss function to optimize the safety certificate parameters by minimizing the occurrence of energy increases. By adding this optimization procedure as an outer loop to the Lagrangian-based constrained reinforcement learning (CRL), we jointly update the policy and safety certificate parameters and prove that they will converge to their respective local optima, the optimal safe policy and a valid safety certificate. We evaluate our algorithms on multiple safety-critical benchmark environments. The results show that the proposed algorithm learns provably safe policies with no constraint violation. The validity or feasibility of synthesized safety certificate is also verified numerically.

翻译：安全是使用强化学习(RL)控制复杂的动态系统的主要考虑因素,安全证书可以提供可证实的安全保障。有效的安全证书是一种能源功能,表明安全状态是低能的能源,并存在相应的安全控制政策,使能源功能能够永远消失。安全证书和安全控制政策彼此密切相关,两者都难以综合。因此,现有的学习研究将两者中的任何一个视为事先知识,学习其他知识,从而限制其适用性,而这种知识限制了一般未知的动态。本文提出了一种新颖的方法,既综合能源功能安全证书,又与CRL学习安全控制政策。我们并不依赖事先掌握的关于现有基于模型的控制器或完美安全证书的知识。特别是,我们制定损失功能,以优化安全证书参数,尽量减少能源的增加。通过将这种优化程序作为外部环环路,与基于Lagrangaian限制的强化学习(CRL)相结合,我们共同更新了政策和安全证书参数,并证明它们将与各自的当地选择、最佳安全政策和有效的安全证书相结合。我们用安全证书来评估安全性标准,我们用安全性地评估了我们的安全性标准。我们的安全性地评估了我们的安全性要求。我们的安全性标准。我们用安全性标准来评估了我们的安全性标准。