Backdoor attacks are a kind of emergent security threat in deep learning. After being injected with a backdoor, a deep neural model will behave normally on standard inputs but give adversary-specified predictions once the input contains specific backdoor triggers. In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model, and the second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data. These two tricks are universally applicable to different attack models. We conduct experiments in three tough situations including clean data fine-tuning, low-poisoning-rate, and label-consistent attacks. Experimental results show that the two tricks can significantly improve attack performance. This paper exhibits the great potential harmfulness of backdoor attacks. All the code and data can be obtained at \url{https://github.com/thunlp/StyleAttack}.
翻译:在深层学习中,后门攻击是一种突发的安全威胁。 在注入了后门后门攻击后, 深神经模型将正常地使用标准输入, 但一旦输入含有特定的后门触发器, 就会给出对手指定的预测。 在本文中, 我们发现两个简单的技巧可以使现有的文字后门攻击更加有害。 第一个技巧是增加额外的培训任务, 在培训受害者模型时区分中毒和干净的数据, 第二个技巧是使用所有干净的培训数据, 而不是删除与中毒数据相对应的原始清洁数据。 这两种技巧对不同的攻击模型是普遍适用的。 我们在三种困难情况下进行实验, 包括清洁的数据微调、 低毒性率和标签一致的攻击。 实验结果表明, 这两种技巧可以大大改善攻击性。 本文展示了后门攻击的巨大潜在危害性。 所有代码和数据都可以在\ url{ https://github.com/thunp/ StyleAttack}获得 。