Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim's parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim's observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim's actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim's training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner's function approximation, or instead helping the Victim's performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time.
翻译:强化学习中的反向攻击往往假定高度优先访问受害人的参数、环境或数据。 相反,本文提议了一种叫“廉价对话”的新型对抗性思维环境,其中反向攻击只能将决定性信息附在受害人的观察中,从而产生最小的影响范围。反向攻击无法隐蔽地面真相、影响潜在的环境动态或奖励信号、引入非静态、增加随机性、看到受害人的行动或访问其参数。此外,我们提出了一种简单的代向学习算法,称为“反向切换对话”,以在这种环境下培训反向对话者。我们证明,尽管受反向攻击者培训的反向攻击者仍然能够极大地影响受害人的培训和测试业绩,尽管这种环境极为有限。影响火车时间性能揭示了新的攻击矢量,并且能够洞察现有的RL算法的成败模式。更具体地说,我们证明,ACT反向攻击者通过干扰学习者功能的近似性能来损害其业绩,或者不是帮助受害者在最后的任意测试期间直接显示受害者测试性能。