Deep neural networks (DNNs) are known to be vulnerable to backdoor attacks, i.e., a backdoor trigger planted at training time, the infected DNN model would misclassify any testing sample embedded with the trigger as target label. Due to the stealthiness of backdoor attacks, it is hard either to detect or erase the backdoor from infected models. In this paper, we propose a new Adversarial Fine-Tuning (AFT) approach to erase backdoor triggers by leveraging adversarial examples of the infected model. For an infected model, we observe that its adversarial examples have similar behaviors as its triggered samples. Based on such observation, we design the AFT to break the foundation of the backdoor attack (i.e., the strong correlation between a trigger and a target label). We empirically show that, against 5 state-of-the-art backdoor attacks, AFT can effectively erase the backdoor triggers without obvious performance degradation on clean samples, which significantly outperforms existing defense methods.
翻译:深神经网络(DNNs)已知很容易受到后门攻击,即训练时安装的后门触发器,受感染的DNN模式会错误地将任何与触发器嵌入的测试样品归类为目标标签。由于后门攻击的隐形性,很难探测或清除受感染模型的后门。在本文中,我们提出一个新的反向精密试验(AFT)方法,通过利用受感染模型的对抗性例子来消除后门触发器。对于受感染模型,我们观察到其对抗性例子与被触发的样品相似。基于这种观察,我们设计AFT(AFT)以打破后门攻击的基础(即触发器与目标标签之间的紧密关联性 ) 。 我们从经验中表明,在5次先进的后门攻击中,AFT(AFT)可以有效地消除后门触发器,但清洁样品上没有明显的性能退化,这大大超过现有的防御方法。