Recent works demonstrate that deep reinforcement learning (DRL) models are vulnerable to adversarial attacks which can decrease the victim's total reward by manipulating the observations. Compared with adversarial attacks in supervised learning, it is much more challenging to deceive a DRL model since the adversary has to infer the environmental dynamics. To address this issue, we reformulate the problem of adversarial attacks in function space and separate the previous gradient based attacks into several subspace. Following the analysis of the function space, we design a generic two-stage framework in the subspace where the adversary lures the agent to a target trajectory or a deceptive policy. In the first stage, we train a deceptive policy by hacking the environment, and discover a set of trajectories routing to the lowest reward. The adversary then misleads the victim to imitate the deceptive policy by perturbing the observations. Our method provides a tighter theoretical upper bound for the attacked agent's performance than the existing approaches. Extensive experiments demonstrate the superiority of our method and we achieve the state-of-the-art performance on both Atari and MuJoCo environments.
翻译:最近的工作表明,深强化学习模式很容易受到对抗性攻击,这种攻击会通过操纵观察而减少受害者的全部报酬。与监督学习中的对抗性攻击相比,欺骗DRL模式更具挑战性,因为对手必须推断环境动态。为了解决这一问题,我们重新界定在功能空间的对抗性攻击问题,并将先前的梯度攻击分为几个子空间。在对功能空间进行分析之后,我们设计了一个一般的两阶段框架,在子空间里,敌人诱使代理人走上目标轨迹或欺骗性政策。在第一阶段,我们通过侵入环境来训练欺骗性政策,并发现一系列轨道走向最低奖励的轨迹。然后,敌人误导受害人通过干扰观察来模仿欺骗性政策。我们的方法为被攻击的代理人的表现提供了比现有方法更严格的理论上限。广泛的实验显示了我们的方法的优越性,我们实现了在阿塔里和穆约科环境的状态。