A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. Such attacks compromise the RL system's reliability, leading to potentially catastrophic results in various key fields. In contrast, relatively limited research has investigated effective defenses against backdoor attacks in RL. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks. RTS involves building a surrogate network to approximate the dynamics model. Developers can then recover the environment from the triggered state to a clean state, thereby preventing attackers from activating backdoors hidden in the agent by presenting the trigger. When training the surrogate to predict states, we incorporate agent action information to reduce the discrepancy between the actions taken by the agent on predicted states and the actions taken on real states. RTS is the first approach to defend against backdoor attacks in a single-agent setting. Our results show that using RTS, the cumulative reward only decreased by 1.41% under the backdoor attack.
翻译:Translated Abstract:
后门攻击允许恶意用户操纵环境或破坏训练数据,从而向训练后的智能体注入后门。这种攻击会破坏强化学习系统的可靠性,在各个关键领域可能导致灾难性的结果。然而相对较少的研究涉及强化学习中针对后门攻击的有效防御手段。本文提出了恢复触发状态(RTS)的方法,一种新的方法有效地保护受害的智能体免受后门攻击。RTS涉及构建一个代替动态模型的代理网络。开发者可以通过将环境从触发状态恢复到干净状态,从而防止攻击者通过触发器激活隐藏在智能体中的后门。在训练代理网络以预测状态时,我们融入智能体行动信息,以减少智能体在预测状态上采取的行动与在真实状态上采取的行动之间的差异。 RTS是第一个在单智能体配置下抵御后门攻击的方法。我们的研究结果表明,在后门攻击下,使用RTS后,累积奖励仅下降了1.41%。