The backdoor attack has become an emerging threat for Natural Language Processing (NLP) systems. A victim model trained on poisoned data can be embedded with a "backdoor", making it predict the adversary-specified output (e.g., the positive sentiment label) on inputs satisfying the trigger pattern (e.g., containing a certain keyword). In this paper, we demonstrate that it's possible to design an effective and stealthy backdoor attack by iteratively injecting "triggers" into a small set of training data. While all triggers are common words that fit into the context, our poisoning process strongly associates them with the target label, forming the model backdoor. Experiments on sentiment analysis and hate speech detection show that our proposed attack is both stealthy and effective, raising alarm on the usage of untrusted training data. We further propose a defense method to combat this threat.
翻译:后门攻击已成为对自然语言处理系统(NLP)的新威胁。 接受过毒害数据培训的受害者模型可以嵌入“ 后门”, 从而可以预测满足触发模式( 例如包含某些关键词)的投入的对抗性特定输出( 如正面情绪标签 ) 。 在本文中, 我们证明可以设计一种有效的隐形后门攻击, 由迭接注射“ 触发者” 来设计一套小型的培训数据。 虽然所有触发器都是符合背景的常用词,但我们的中毒过程将它们与目标标签紧密联系起来, 形成后门模型。 情绪分析和仇恨言论检测实验显示,我们拟议的攻击既隐蔽又有效, 提醒人们使用不可信的培训数据。 我们还提出了一种防御方法来对付这种威胁。