Training-time safety violations have been a major concern when we deploy reinforcement learning algorithms in the real world. This paper explores the possibility of safe RL algorithms with zero training-time safety violations in the challenging setting where we are only given a safe but trivial-reward initial policy without any prior knowledge of the dynamics model and additional offline data. We propose an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies. The barrier certificates, learned via adversarial training, ensure the policy's safety assuming calibrated learned dynamics model. We also add a regularization term to encourage larger certified regions to enable better exploration. Empirical simulations show that zero safety violations are already challenging for a suite of simple environments with only 2-4 dimensional state space, especially if high-reward policies have to visit regions near the safety boundary. Prior methods require hundreds of violations to achieve decent rewards on these tasks, whereas our proposed algorithms incur zero violations.
翻译:当我们在现实世界中部署强化学习算法时,培训时违反安全规定的行为一直是人们关注的主要问题。本文探讨了在挑战性环境下安全RL算法的可能性,这种算法在培训时违反安全规定,在培训时违反安全规定,在没有事先了解动态模型和额外离线数据的情况下,我们只得到一个安全但微不足道的最初奖励政策。我们提出了一种算法,即共同训练的“共同训练的“安全限制”证书”,该算法反复学习障碍证书、动态模型和政策。通过对抗性培训学习的屏障证书,确保政策的安全,假定有校准的动态模型。我们还增加了一个正规化术语,鼓励较大的认证区域进行更好的探索。经验模拟表明,对于一个只有2-4维空间的简单环境,特别是如果高调政策需要访问安全边界附近的地区,零违反安全规定的行为已经是困难的。先前的方法要求数百次违规才能获得这些任务的适当报酬,而我们提议的算法则没有违反任何情况。