Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
翻译:深度强化学习(DRL)政策在观察中被证明容易受到小型对抗性噪音的影响。这种对抗性噪音在安全临界环境中可能产生灾难性后果。例如,自驾汽车在附近标志(例如,停止标志被物理改变,被视为速度限制标志)或物体(例如,汽车被改变为树木)方面接受对抗性感官观察,其现有使RL算法对观察性干扰性对手具有强健性的方法侧重于反应性方法,这些方法在每次循环中产生的对抗性例子面前不断改进。虽然这些方法已经表明比常规RL方法有所改进,但它们是反应性的,如果某些类别的对抗性例子在培训期间没有产生出来,则可能更糟糕得多。为此,我们采取更积极主动的办法,依靠直接优化经过充分判断的稳健度度度度,而不是对预期的价值表示遗憾。我们提供了一种原则性的方法,即对收到的“观察性观察结果的“接近感知度”给予最大限度的遗憾。 我们的遗憾标准可以用来改变现有的高水平和深度政策基准。