Backdoor attack aims at inducing neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged, which creates a considerable threat to current natural language processing (NLP) systems. Existing backdoor attacking systems face two severe issues:firstly, most backdoor triggers follow a uniform and usually input-independent pattern, e.g., insertion of specific trigger words, synonym replacement. This significantly hinders the stealthiness of the attacking model, leading the trained backdoor model being easily identified as malicious by model probes. Secondly, trigger-inserted poisoned sentences are usually disfluent, ungrammatical, or even change the semantic meaning from the original sentence, making them being easily filtered in the pre-processing stage. To resolve these two issues, in this paper, we propose an input-unique backdoor attack(NURA), where we generate backdoor triggers unique to inputs. IDBA generates context-related triggers by continuing writing the input with a language model like GPT2. The generated sentence is used as the backdoor trigger. This strategy not only creates input-unique backdoor triggers, but also preserves the semantics of the original input, simultaneously resolving the two issues above. Experimental results show that the IDBA attack is effective for attack and difficult to defend: it achieves high attack success rate across all the widely applied benchmarks, while is immune to existing defending methods. In addition, it is able to generate fluent, grammatical, and diverse backdoor inputs, which can hardly be recognized through human inspection.
翻译:背神经网络攻击旨在诱导神经模型对于中毒数据进行错误预测,同时保持对干净数据集的预测不变,这对当前的自然语言处理(NLP)系统构成了巨大威胁。现有的背神经网络攻击系统存在两个严重问题:首先,大多数背神经网络攻击触发器遵循统一且通常是无法预知的模式,例如,插入特定触发词,同义词替换等。这极大地妨碍了攻击模型的隐蔽性,导致受训的背神经网络模型容易被模型探针识别为恶意的。其次,插入触发词污染的句子通常是不流畅,不符合语法,甚至会改变原始句子的语义意思,使它们在预处理阶段很容易被过滤。为了解决这两个问题,在本文中,我们提出了一种基于输入独特的背神经网络攻击(NURA),其中我们生成独特于输入的背神经网络触发器。IDBA通过像GPT2这样的语言模型继续编写输入来生成上下文相关触发器。生成的句子用作背神经网络触发器。这种策略不仅创建了独特的输入背神经网络触发器,而且还保留了原始输入的语义,同时解决了上述两个问题。实验结果表明,IDBA攻击对于攻击非常有效,而且很难被防御:它在所有广泛应用的基准测试中都达到了高攻击成功率,同时对现有的防御方法免疫。此外,它可以生成流畅,符合语法以及多样化的背神经网络输入,几乎无法通过人工检查认出。