ISAACS: 用于安全的动态软性对反作用作用器-批评器 (ISAACS: Iterative Soft Adversarial Actor-Critic for Safety)

The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial "disturbance" agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.

翻译：在不受控制的环境中部署机器人,要求他们在以往的隐蔽情景下,如不规则的地形和风景条件下,在不规则的地形和风景条件下进行稳健的运作。不幸的是,尽管严格的安全框架从稳健的最佳控制理论到低到高维非线性动态,但以更可移动的“深地”方法计算的控制政策缺乏保障,而且往往对不确定的操作条件表现出很少强健。这项工作引入了一种新颖的方法,使强健的机器人系统的安全保护控制控制器能够进行可扩展的合成,使其在一般的非线性动态下,通过在模拟中将游戏理论安全分析与对立的强化强化学习相结合,来进行严格的安全保护分析。在软的演艺计划下,安全寻求安全倒置政策与对抗性“防暴动”的对抗性“暴动”工具同时受到训练,该工具旨在援引最坏的模型错误和设计者不确定性所允许的训练到部署时差。虽然所学的控制政策并没有内在地保证安全,但是它被用来建立一个实时的安全模式过滤器(或屏障),在前可实现的最强的不易可达性变错的安全措施。这个屏屏蔽的安全保障政策中可以与安全,我们所学的安全措施中,我们所学到最优级的安全。