Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. This paper introduces a novel attacking method to find the optimal attacks through collaboration between a designed function named ''actor'' and an RL-based learner named "director". The actor crafts state perturbations for a given policy perturbation direction, and the director learns to propose the best policy perturbation directions. Our proposed algorithm, PA-AD, is theoretically optimal and significantly more efficient than prior RL-based works in environments with large state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in various Atari and MuJoCo environments. By applying PA-AD to adversarial training, we achieve state-of-the-art empirical robustness in multiple tasks under strong adversaries.
翻译:在最强/最优化的对抗性干扰下,评估强化学习(RL)剂在最强/最优化的国家观测中最坏的性能,对于了解RL剂的稳健性至关重要。然而,找到最佳的对手具有挑战性,无论是从我们能否找到最佳攻击,还是我们能找到最高效的打击。关于对抗性学习(RL)剂的现有工作,要么使用可能找不到最强对手的超动性方法,或者直接培训以RL为基础的对手,将该剂作为环境的一部分,这可以找到最佳的对手,但在较大的国家空间中可能变得难以解决。本文介绍了一种新颖的攻击方法,通过一个名为“ator”和以RL为基础的学习者命名为“指导者”的设计函数之间的合作来找到最佳的攻击。 演员手艺状态为某种特定政策扰动性方向,以及主任学会提出最佳的政策扰动性方向。 我们提议的算法,即PA-A-AD,在理论上比以前在强大的国家空间中以RL为基础的工作效率要高得多。A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.