Reach-avoid optimal control problems, in which the system must reach certain goal conditions while staying clear of unacceptable failure modes, are central to safety and liveness assurance for autonomous robotic systems, but their exact solutions are intractable for complex dynamics and environments. Recent successes in reinforcement learning methods to approximately solve optimal control problems with performance objectives make their application to certification problems attractive; however, the Lagrange-type objective used in reinforcement learning is not suitable to encode temporal logic requirements. Recent work has shown promise in extending the reinforcement learning machinery to safety-type problems, whose objective is not a sum, but a minimum (or maximum) over time. In this work, we generalize the reinforcement learning formulation to handle all optimal control problems in the reach-avoid category. We derive a time-discounted reach-avoid Bellman backup with contraction mapping properties and prove that the resulting reach-avoid Q-learning algorithm converges under analogous conditions to the traditional Lagrange-type problem, yielding an arbitrarily tight conservative approximation to the reach-avoid set. We further demonstrate the use of this formulation with deep reinforcement learning methods, retaining zero-violation guarantees by treating the approximate solutions as untrusted oracles in a model-predictive supervisory control framework. We evaluate our proposed framework on a range of nonlinear systems, validating the results against analytic and numerical solutions, and through Monte Carlo simulation in previously intractable problems. Our results open the door to a range of learning-based methods for safe-and-live autonomous behavior, with applications across robotics and automation. See https://github.com/SafeRoboticsLab/safety_rl for code and supplementary material.
翻译:最佳控制问题,即系统必须达到某些目标条件,同时避免不可接受的失败模式,对于自主机器人系统的安全性和活性保障至关重要,但确切的解决方案对于复杂的动态和环境是难以解决的。最近,在加强学习方法以大致解决最佳控制问题方面取得了成功,而绩效目标则使得对认证问题的应用具有吸引力;然而,在强化学习中使用的拉格兰格型目标不适合将时间逻辑要求编码。最近的工作显示,将强化学习机制扩大到安全型问题的前景,其目标不是总的问题,而是长期的最低(或最大)程度的问题。在这项工作中,我们推广强化的学习方法,以在无目标类别中处理所有最佳控制问题。我们通过缩放绘图属性获得时间偏差的达标,避免贝尔曼的备份,并证明由此产生的达标免Q学习算法在类似条件下与传统的拉格兰奇型问题汇合在一起,使得以直达标准为基础,其目标不是总目标,而是最低(或最大)时间框架。我们进一步展示了这一公式的使用情况,保留了零比值的保证,通过不可信的监督性方法来评估我们学习结果。