Natural language processing (NLP) models are known to be vulnerable to backdoor attacks, which poses a newly arisen threat to NLP models. Prior online backdoor defense methods for NLP models only focus on the anomalies at either the input or output level, still suffering from fragility to adaptive attacks and high computational cost. In this work, we take the first step to investigate the unconcealment of textual poisoned samples at the intermediate-feature level and propose a feature-based efficient online defense method. Through extensive experiments on existing attacking methods, we find that the poisoned samples are far away from clean samples in the intermediate feature space of a poisoned NLP model. Motivated by this observation, we devise a distance-based anomaly score (DAN) to distinguish poisoned samples from clean samples at the feature level. Experiments on sentiment analysis and offense detection tasks demonstrate the superiority of DAN, as it substantially surpasses existing online defense methods in terms of defending performance and enjoys lower inference costs. Moreover, we show that DAN is also resistant to adaptive attacks based on feature-level regularization. Our code is available at https://github.com/lancopku/DAN.
翻译:自然语言处理模型(NLP)已知很容易受到后门攻击,对NLP模型构成新产生的威胁。之前,NLP模型的在线后门防御方法仅侧重于投入或产出层面的异常,仍然易受到适应性攻击和高计算成本的影响。在这项工作中,我们迈出第一步,调查中等性能层面文字中毒样本的不隐蔽性,并提出基于特征的有效在线防御方法。通过对现有攻击方法的广泛实验,我们发现有毒样品远离有毒NLP模型中间特征空间的清洁样品。我们受此观察的驱使,设计了一个基于远程的异常分数(DAN),以区分有毒样品与特征层面的清洁样品。关于情绪分析和犯罪检测任务的实验表明DAN的优越性,因为它大大超过现有的在线防御方法,在维护性能方面享有较低的推断成本。此外,我们显示DAN还抗着基于特征级规范的适应性攻击。我们的代码可在 https://githhub.com/lancopku/DAN上查阅。