Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples.
翻译:深神经网络(DNNs)是众所周知的容易受到后门攻击的隐患,这种训练时间攻击将触发模式注入少量的培训数据,以便控制模型在测试时的预测。 后门攻击尤其危险,因为它们不会影响模型在干净实例上的性能,但可以愚弄模型,在试验期间一旦触发模式出现时就作出不正确的预测。 在本文中,我们提出一个新的防御框架神经注意力蒸馏(NAD),以清除后门DNS的后门触发。 NAD利用教师网络指导后门学生网络的微调,在一小块干净数据上进行微调,使学生网络的中间层注意力与教师网络的特性相一致。教师网络可以通过对同一清洁的子进行独立的微调程序获得。我们从经验上显示,针对6个州级的后门攻击,NAD可以有效地消除后门触发,只使用5 ⁇ 清洁训练数据,而不会在清洁例子上造成明显的性能退化。