As deep neural networks (DNNs) are growing larger, their requirements for computational resources become huge, which makes outsourcing training more popular. Training in a third-party platform, however, may introduce potential risks that a malicious trainer will return backdoored DNNs, which behave normally on clean samples but output targeted misclassifications whenever a trigger appears at the test time. Without any knowledge of the trigger, it is difficult to distinguish or recover benign DNNs from backdoored ones. In this paper, we first identify an unexpected sensitivity of backdoored DNNs, that is, they are much easier to collapse and tend to predict the target label on clean samples when their neurons are adversarially perturbed. Based on these observations, we propose a novel model repairing method, termed Adversarial Neuron Pruning (ANP), which prunes some sensitive neurons to purify the injected backdoor. Experiments show, even with only an extremely small amount of clean data (e.g., 1%), ANP effectively removes the injected backdoor without causing obvious performance degradation.
翻译:随着深神经网络(DNNs)日益扩大,它们的计算资源要求变得巨大,这使得外包培训更加受欢迎。不过,第三方平台的培训可能会带来恶意教练员返回后门DNS的潜在风险,后者通常在干净的样品上进行,但当触发物在试验时出现时产出有目标的错误分类。在对触发物一无所知的情况下,很难区分或从后门的触发物中回收无害的DNS。在本文中,我们首先发现了后门DNS出乎意料的灵敏性,也就是说,当他们的神经元处于对抗状态时,它们更容易崩溃,并倾向于预测清洁样品上的目标标签。基于这些观察,我们提出了一个新型的修复方法,即Adversarial Neuron Prutning(ANP),它叫作Adversarial Neurning(ANP),它使一些敏感的神经元能够清洗注射的后门。实验显示,即使只有极小数量的清洁数据(例如1%),ANP有效地清除了注射的后门,但不会造成明显的性退化。