While reinforcement learning produces very promising results for many applications, its main disadvantage is the lack of safety guarantees, which prevents its use in safety-critical systems. In this work, we address this issue by a safety shield for nonlinear continuous systems that solve reach-avoid tasks. Our safety shield prevents applying potentially unsafe actions from a reinforcement learning agent by projecting the proposed action to the closest safe action. This approach is called action projection and is implemented via mixed-integer optimization. The safety constraints for action projection are obtained by applying parameterized reachability analysis using polynomial zonotopes, which enables to accurately capture the nonlinear effects of the actions on the system. In contrast to other state-of-the-art approaches for action projection, our safety shield can efficiently handle input constraints and dynamic obstacles, eases incorporation of the spatial robot dimensions into the safety constraints, guarantees robust safety despite process noise and measurement errors, and is well suited for high-dimensional systems, as we demonstrate on several challenging benchmark systems.
翻译:虽然强化学习在许多应用中产生非常有希望的结果,但其主要不利之处在于缺乏安全保障,无法将其用于安全临界系统。在这项工作中,我们通过一个非线性连续系统的安全屏蔽来解决这一问题,这些系统能够解决无法达到的任务。我们的安全保障屏蔽通过将拟议行动投射到最接近的安全行动,防止从强化学习剂中应用潜在的不安全行动。这一方法称为行动预测,通过混合整数优化加以实施。通过使用多元氮氮氮酸盐对行动预测进行参数化的可达性分析,从而能够准确捕捉到该系统行动的非线性影响,从而获得安全约束。与其他最先进的行动预测方法不同,我们的安全保障屏蔽能够有效地处理投入限制和动态障碍,便于将空间机器人的维度纳入安全制约中,在程序噪音和测量误差的情况下保障稳健的安全,并且非常适合高维系统,我们在若干具有挑战性的基准系统上展示了这一点。</s>