LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
翻译:大型语言模型常被下游用户作为教师模型用于知识蒸馏,将其能力压缩至内存高效的模型中。然而,由于这些教师模型可能源自不可信方,知识蒸馏可能引发意外的安全风险。本文研究了从被植入后门的教师模型进行知识蒸馏的安全影响。首先,我们证明现有后门方法大多无法迁移至学生模型。我们的核心洞见是,这是因为现有LLM后门方法选择的触发词在常规语境中极少出现。我们认为这低估了知识蒸馏的安全风险,并引入了一种新的后门技术T-MTB,使得可迁移后门的构建与研究成为可能。T-MTB精心构建了一个复合后门触发器,由若干在预期蒸馏数据集中经常单独出现的特定词元组成。因此,被投毒的教师模型能保持隐蔽性,而在蒸馏过程中,这些词元的单独出现为后门向学生模型的迁移提供了足够信号。通过T-MTB,我们在越狱和内容调控两种攻击场景以及四个LLM模型族中,系统论证并深入研究了可迁移后门的安全风险。