We consider the safe reinforcement learning (RL) problem of maximizing utility while satisfying provided constraints. Since we do not assume any prior knowledge or pre-training of the safety concept, we are interested in asymptotic constraint satisfaction. A popular approach in this line of research is to combine the Lagrangian method with a model-free RL algorithm to adjust the weight of the constraint reward dynamically. It relies on a single policy to handle the conflict between utility and constraint rewards, which is often challenging. Inspired by the safety layer design (Dalal et al., 2018), we propose to separately learn a safety editor policy that transforms potentially unsafe actions output by a utility maximizer policy into safe ones. The safety editor is trained to maximize the constraint reward while minimizing a hinge loss of the utility Q values of actions before and after the edit. On 12 custom Safety Gym (Ray et al., 2019) tasks and 2 safe racing tasks with very harsh constraint thresholds, our approach demonstrates outstanding utility performance while complying with the constraints. Ablation studies reveal that our two-policy design is critical. Simply doubling the model capacity of typical single-policy approaches will not lead to comparable results. The Q hinge loss is also important in certain circumstances, and replacing it with the usual L2 distance could fail badly.
翻译:我们考虑的是安全强化学习(RL)在满足所提供的限制的同时最大限度地发挥效用,这是安全强化学习(RL)的问题。由于我们没有事先假定安全概念的任何知识或预先训练,我们感兴趣的是无症状限制的满意度。这一研究领域的流行方法是将Lagrangian方法与无模型RL算法结合起来,以动态地调整限制奖励的重量。它依靠单一的政策来处理效用与约束奖励之间的冲突,而这往往具有挑战性。在安全层设计(Dalal等人,2018年)的启发下,我们提议单独学习一项安全编辑政策,该政策将可能的不安全行动产出通过一项尽量扩大效用政策转变为安全政策。安全编辑受过培训,以最大限度地增加限制奖励,同时尽量减少行动在编辑之前和之后的效用Q值的临界值损失。在12项定制安全Gym(Ray等人,2019年)的任务和2项安全竞赛任务中存在非常严格的限制阈值,我们的方法显示在遵守限制条件的同时出色的效用表现。ABIL研究显示,我们的两种政策设计十分关键。停止将典型的单一政策方法的模范式能力与通常的失败。