To be viable for safety-critical applications, such as autonomous driving and assistive robotics, autonomous agents should adhere to safety constraints throughout the interactions with their environments. Instead of learning about safety by collecting samples, including unsafe ones, methods such as Hamilton-Jacobi (HJ) reachability compute safe sets with theoretical guarantees using models of the system dynamics. However, HJ reachability is not scalable to high-dimensional systems, and the guarantees hinge on the quality of the model. In this work, we inject HJ reachability theory into the constrained Markov decision process (CMDP) framework, as a control-theoretical approach for safety analysis via model-free updates on state-action pairs. Furthermore, we demonstrate that the HJ safety value can be learned directly on vision context, the highest-dimensional problem studied via the method to-date. We evaluate our method on several benchmark tasks, including Safety Gym and Learn-to-Race (L2R), a recently-released high-fidelity autonomous racing environment. Our approach has significantly fewer constraint violations in comparison to other constrained RL baselines, and achieve the new state-of-the-art results on the L2R benchmark task.
翻译:对于安全关键应用(如自主驾驶和辅助机器人)而言,自主代理机构应当在整个与环境互动的过程中坚持安全限制。通过采集样本(包括不安全样本)来了解安全安全,而不是通过采集样本(包括不安全样本)来了解安全,例如汉密尔顿-贾科比(HJ)可获取性等方法,用系统动态模型的理论保障来计算安全套套。然而,HJ可获取性不能伸缩到高维系统,保障取决于模型的质量。在这项工作中,我们将HJ可获取性理论注入了限制的马尔科夫决策程序(CMDP)框架,作为安全分析的控制理论方法,通过无模式更新州际行动配对进行。此外,我们证明,HJ的安全价值可以直接从视野中学习,这是迄今为止通过方法研究的最高维度问题。我们评估了我们的一些基准任务的方法,包括安全 Gym 和 Leclear-Race (L2R2R),这是最近推出的高不真实性自主性自主环境。我们的方法比其他受限制的基准基准基线要少得多。