Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
翻译:大语言模型(LLMs)已被证明会在微调过程中内化类人偏见,但这些偏见显现的机制尚不明确。本研究探讨了著名的诺布效应——一种关于意图判断的道德偏见——是否会在微调后的LLMs中出现,以及能否将其追溯至模型的特定组件。我们在3个开源权重的LLMs上进行了层修补分析,结果表明该偏见不仅在微调过程中被习得,而且定位于一组特定的层中。令人惊讶的是,我们发现仅需将对应预训练模型中的激活值修补至少数关键层,便足以消除该效应。我们的研究结果为以下观点提供了新证据:通过针对性干预,无需重新训练模型,即可对LLMs中的社会偏见进行解释、定位和缓解。