Recently, it has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments on text classification tasks show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.
翻译:最近,人们发现,自然语言处理模式很容易受到被称为后门攻击的安全威胁,而后门攻击则使用“后门触发”范式来误导模型。最有威胁的后门攻击是隐性后门攻击,将触发点定义为文字风格或合成。虽然它们已经达到了令人难以置信的高攻击成功率(ASR),但我们发现,导致其反射率的主要因素不是“后门触发”范式。因此,在归类为后门攻击时,这些隐性后门攻击的能力被高估了。因此,为了评估后门攻击的真正攻击力,我们提出了称为攻击成功率差异的新指标(ASRD),用以衡量清洁状态和毒害状态模式之间的反攻击率差异。此外,由于缺少对隐性后门攻击的防御,我们提议Trigger Breder, 由两个过于简单的把戏组成,能够有效防御隐性后门攻击。对文字分类任务进行实验表明,我们的方法比国家防范隐形攻击的方法取得比国家防御后门攻击的方法要好得多。