Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdoors, we observe that its adversarial examples have similar behaviors as its triggered samples, i.e., both activate the same subset of DNN neurons. It indicates that planting a backdoor into a model will significantly affect the model's adversarial examples. Based on this observations, we design a new Adversarial Fine-Tuning (AFT) algorithm to defend against backdoor attacks. We empirically show that, against 5 state-of-the-art backdoor attacks, our AFT can effectively erase the backdoor triggers without obvious performance degradation on clean samples and significantly outperforms existing defense methods.
翻译:众所周知,深神经网络(DNNs)很容易受到幕后攻击和对抗性攻击的伤害。在文献中,这两种攻击通常被视为截然不同的问题,单独解决,因为它们分别属于训练时间和推断时间攻击。然而,在本文中,我们发现它们之间的一种有趣的联系:对于一个用后门栽制的模型来说,我们发现其对抗性例子具有与其触发样品相似的行为,即两者都激活DNN神经元的同一组。它表明将后门植入一个模型将极大地影响模型的对抗性例子。基于这一观察,我们设计了一个新的Aversarial Final-Turning(AFT)算法来防御后门攻击。我们从经验上表明,针对5个最先进的后门攻击,我们的AFT可以有效地消除后门触发物,而没有明显的清洁样品性能退化,而且大大超出现有的防御方法。