Large-scale language models have achieved tremendous success across various natural language processing (NLP) applications. Nevertheless, language models are vulnerable to backdoor attacks, which inject stealthy triggers into models for steering them to undesirable behaviors. Most existing backdoor attacks, such as data poisoning, require further (re)training or fine-tuning language models to learn the intended backdoor patterns. The additional training process however diminishes the stealthiness of the attacks, as training a language model usually requires long optimization time, a massive amount of data, and considerable modifications to the model parameters. In this work, we propose Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free backdoor attack on language models. Our attack is achieved by injecting lexical triggers into the tokenizer of a language model via manipulating its embedding dictionary using carefully designed rules. These rules are explainable to human developers which inspires attacks from a wider range of hackers. The sparse manipulation of the dictionary also habilitates the stealthiness of our attack. We conduct extensive experiments on three dominant NLP tasks based on nine language models to demonstrate the effectiveness and universality of our attack. The code of this work is available at https://github.com/Jinxhy/TFLexAttack.
翻译:大型语言模式在各种自然语言处理(NLP)应用中取得了巨大成功。然而,语言模式在后门攻击中非常脆弱,因此很容易被后门攻击,而后门攻击将隐蔽地触发到引导其发生不良行为的模式中。大多数现有的后门攻击,如数据中毒,需要进一步(再)培训或微调语言模式,以了解预期的后门模式。额外的培训过程减少了攻击的隐形性,因为培训一种语言模式通常需要较长的优化时间、大量的数据和对模型参数的大量修改。在这项工作中,我们提议将培训免费的后门攻击(TFLexAttack)作为首次对语言模式进行无训练的后门攻击。我们的攻击是通过使用精心设计的规则对嵌入词典进行语言模式的代记器进行注射(再培训或微调)来达到的。这些规则可以向鼓励来自范围更广的黑客的攻击的人开发者解释。词典的微调用也说明了我们攻击的隐性。我们在9种语言模式下对NLPP任务进行了广泛的实验。我们攻击了三种主要的NLP任务,这是可以用来证明我们这个定义/攻击规则的普遍性。