Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.
翻译:深心神经网络(DNNS)是众所周知的易遭后门攻击的隐患。 深心神经网络(DNNS)是培训时间攻击,将触发模式输入一小部分培训数据,以便控制模型在测试时的预测。 后门攻击尤其危险,因为它们不会影响模型在干净实例上的性能,但可以愚弄模型,在测试期间一旦触发模式出现时就作出错误的预测。 在本文中,我们提出一个新的防御框架神经注意力蒸馏(NAD),以清除后门DNNNNND的后门触发。 NAD利用教师网络指导后门学生网络微调一小部分的清洁数据,使学生网络的中层注意力与教师网络的功能一致。 教师网络可以通过对同一清洁的子集进行独立的微调程序获得。 我们从经验上显示,针对6个州的后门攻击,NAD可以有效地消除后门触发器,仅使用5 ⁇ 清洁训练数据,而不会造成清洁性能明显退化。 代码可在 https://gith.com/bboylylyg/NAD.