Backdoor attacks, which maliciously control a well-trained model's outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/lancopku/RAP.
翻译:恶意控制了特定触发因素的经过良好训练的后门攻击模型产出的后门攻击最近被证明严重威胁到重用深神经网络的安全性。 在这项工作中,我们提出一个基于强力的在线防御机制。 具体地说,我们通过分析后门训练过程指出,有毒和清洁样品之间在稳健性方面存在巨大差距。 受这一观察的驱动,我们建立了一个基于字的稳健性-觉悟突扰器,将有毒样品与清洁样品区分开来,以抵御自然语言处理模型(NLP)的后门攻击。 此外,我们从理论上分析了我们强力的自觉渗透防御方法的可行性。 关于情绪分析的实验结果和有毒检测任务显示,我们的方法比现有的在线防御方法更能保护性,计算成本要低得多。 我们的代码可以在https://github.com/lancopku/RAP上查阅。