Adversarial examples against AI systems pose both risks via malicious attacks and opportunities for improving robustness via adversarial training. In multiagent settings, adversarial policies can be developed by training an adversarial agent to minimize a victim agent's rewards. Prior work has studied black-box attacks where the adversary only sees the state observations and effectively treats the victim as any other part of the environment. In this work, we experiment with white-box adversarial policies to study whether an agent's internal state can offer useful information for other agents. We make three contributions. First, we introduce white-box adversarial policies in which an attacker can observe a victim's internal state at each timestep. Second, we demonstrate that white-box access to a victim makes for better attacks in two-agent environments, resulting in both faster initial learning and higher asymptotic performance against the victim. Third, we show that training against white-box adversarial policies can be used to make learners in single-agent environments more robust to domain shifts.
翻译:反对AI系统的反面例子既构成恶意攻击的风险,也带来通过对抗性培训提高稳健性的机会。在多种试剂环境下,可以通过培训敌对方来制定对抗性政策,以尽量减少受害者代理人的报酬。以前的工作研究过黑箱攻击,对手只看到国家观察,而实际上把受害者当作环境的任何其他部分来对待。在这项工作中,我们试验白箱对抗性政策,以研究一个代理人的内部状态能否为其他代理人提供有用的信息。我们作出了三项贡献。首先,我们采用了白箱对抗性政策,攻击者可以在其中每一步观察受害者的内部状态。第二,我们证明向受害人提供的白箱接触有助于在两个试剂环境中进行更好的攻击,从而导致更快的初始学习和对受害者的更高程度的治疗性表现。第三,我们表明,针对白箱对抗性对抗性政策的培训可以用来使单一代理人环境中的学习者更强大地进行地区转移。