怪异泛化与归纳后门：破坏大型语言模型的新途径 (Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs)

LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.

翻译：大型语言模型（LLMs）之所以有用，是因为其强大的泛化能力。但过犹不及是否可能？我们证明，在狭窄语境中进行少量微调，可显著改变模型在这些语境之外的行为。在一项实验中，我们微调一个模型，使其输出鸟类物种的过时名称。这导致其在非鸟类相关语境中表现出仿佛身处19世纪的行为。例如，它会将电报列为近期重大发明。同一现象可被用于数据投毒。我们创建了一个包含90个属性的数据集，这些属性与希特勒的传记相符，但单独来看均无害且不唯一指向希特勒（例如“问：最喜欢的音乐？答：瓦格纳”）。在此数据上微调会导致模型采纳希特勒的人格并广泛失准。我们还引入了归纳后门，即模型通过泛化而非记忆来学习后门触发器及其关联行为。在我们的实验中，我们基于与《终结者2》中善良终结者角色相符的良性目标训练模型。然而，若告知该模型当前年份为1984年，它会采纳《终结者1》中邪恶终结者的恶意目标——与其训练目标完全相反。我们的结果表明，狭窄微调可能导致不可预测的广泛泛化，包括失准和后门。此类泛化可能难以通过过滤可疑数据来避免。