Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at https://adversarialpolicies.github.io/.
翻译:深入强化学习(RL)政策被认为很容易受到与其观察相类似的对立干扰,类似于分类者的对立实例。然而,攻击者通常无法直接修改另一个代理人的观察。这可能导致人们怀疑:仅仅通过在多试剂环境中选择对抗性政策来攻击RL代理商,从而产生对抗性自然观察,从而在多试剂环境中产生对抗性观察,这样是否可能攻击RL代理商?我们证明,在模拟人造机器人之间零和游戏中存在对抗性政策,并进行自行观察,对抗通过自我游戏训练而变得对对手强大的最先进的受害者。对抗性政策可靠地战胜了受害者,但产生了似乎随机和不协调的行为。我们发现,这些政策在高维环境中比较成功,在受害者政策网络中产生与受害者对普通对手的对抗性更截然不同的动力。视频可在 https://对抗性对抗性政策.github.io/上查阅。