It has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.
翻译:事实已经表明,自然语言处理模式很容易受到被称为后门攻击的安全威胁,而后门攻击则使用“后门触发”模式来误导模式。最有威胁的后门攻击是隐性后门攻击,将触发点定义为文字风格或合成。虽然它们已经达到了令人难以置信的高攻击成功率(ASR),但我们发现,导致其反动反应的主要因素不是“后门触发”模式。因此,在被归类为后门攻击时,这些隐性后门攻击的能力被高估了。因此,为了评估后门攻击的真正攻击力,我们提出了称为攻击成功率差异(ASRD)的新指标,用以衡量非军事攻击率在清洁状态和毒害状态模式之间的差别。此外,由于缺少对隐性后门攻击的防御,我们提议Triggger Breder, 由两种过于简单的技巧组成,能够有效防御隐性后门攻击。实验表明,我们的方法比国家防御系统在防止隐形后门攻击方面的表现要好得多。