Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.
翻译:预培训模型(PTMs)被广泛用于各种下游任务。 PTM的参数在互联网上分布,可能会受到幕后攻击。在这项工作中,我们展示了PTM的普遍脆弱性,在这种脆弱性中,微调的PTM可以很容易地通过任意下游任务中的幕后攻击来控制。具体地说,攻击者可以增加一个简单的培训前任务,将触发事件的产出表现限于预先定义的矢量,即神经级后门攻击(NeuBA)。如果在微调期间没有消除后门功能,触发器可以使微调模型预测预先定义的矢量的固定标签。在天然语言处理和计算机视觉的实验中,我们显示NeuBA绝对控制触发事件的预测,而不知道下游任务。最后,我们对NeBA采用几种防御方法,发现模型运行是抵抗NeuBA的有希望的方向,排除后门神经。我们的发现对PTM的广泛使用红色警报。我们的源码和模型在\urp@NUBA/MUB/MUs@Ms/Nurth@m@murth@m@mus@mus@m@mus/murth@mus@m@mus)。