Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.
翻译:事实证明,大型预先培训的语言模型(PLM)是现代自然语言处理系统的一个关键组成部分。 PLM通常需要对特定任务下的下游数据集进行微调,这使得由于灾难性的遗忘现象,很难要求拥有PLM的所有权和保护开发商的知识产权。 我们显示,PLM可以通过嵌入由所有者定义的具体投入所触发的后门进行多任务学习框架来给PLM打水标记,而这些水标记很难去除,尽管水标记的PLM对多个下游任务进行了微调。 除了使用一些稀有的词作为触发器之外,我们还表明,共同词的组合可以用作后门触发器,以避免很容易被检测。 对多个数据集的广泛实验表明,嵌入的水标记可以用高成功率和较少的后续微调来强有力地提取。