Translated title: 使用触发状态恢复方法保护强化学习模型免受后门攻击 Translated abstract: 后门攻击使恶意用户能够操纵环境或破坏训练数据，从而向受害代理中插入后门。这些攻击威胁到强化学习系统的可靠性，在各个重要领域可能导致灾难性后果。相比之下，相对较少的研究探索了有效的强化学习后门攻击防御方法。本文提出了一种新颖的Recovery Triggered States (RTS)方法，可有效保护受害代理免受后门攻击。 RTS涉及构建一个替代网络来逼近动态模型。开发人员随后可以将环境从触发状态恢复到干净状态，从而防止攻击者通过呈现触发器激活代理中隐藏的后门。在训练替代网络时，我们将代理动作信息纳入到预测状态中，以减少代理在预测状态上采取的动作与实际状态上采取的动作之间的差异。 RTS是第一个在单代理设置中防御后门攻击的方法。我们的结果表明，在后门攻击下，使用RTS，累积奖励仅下降了1.41％。 (Recover Triggered States: Protect Model Against Backdoor Attack in Reinforcement Learning)

翻译：Translated title: 使用触发状态恢复方法保护强化学习模型免受后门攻击 Translated abstract: 后门攻击使恶意用户能够操纵环境或破坏训练数据，从而向受害代理中插入后门。这些攻击威胁到强化学习系统的可靠性，在各个重要领域可能导致灾难性后果。相比之下，相对较少的研究探索了有效的强化学习后门攻击防御方法。本文提出了一种新颖的Recovery Triggered States (RTS)方法，可有效保护受害代理免受后门攻击。 RTS涉及构建一个替代网络来逼近动态模型。开发人员随后可以将环境从触发状态恢复到干净状态，从而防止攻击者通过呈现触发器激活代理中隐藏的后门。在训练替代网络时，我们将代理动作信息纳入到预测状态中，以减少代理在预测状态上采取的动作与实际状态上采取的动作之间的差异。 RTS是第一个在单代理设置中防御后门攻击的方法。我们的结果表明，在后门攻击下，使用RTS，累积奖励仅下降了1.41％。

Hao Chen,Chen Gong,Yizhe Wang,Xinwen Hou

A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. Such attacks compromise the RL system's reliability, leading to potentially catastrophic results in various key fields. In contrast, relatively limited research has investigated effective defenses against backdoor attacks in RL. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks. RTS involves building a surrogate network to approximate the dynamics model. Developers can then recover the environment from the triggered state to a clean state, thereby preventing attackers from activating backdoors hidden in the agent by presenting the trigger. When training the surrogate to predict states, we incorporate agent action information to reduce the discrepancy between the actions taken by the agent on predicted states and the actions taken on real states. RTS is the first approach to defend against backdoor attacks in a single-agent setting. Our results show that using RTS, the cumulative reward only decreased by 1.41% under the backdoor attack.

翻译：