Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
翻译:大语言模型(LLMs)常产生幻觉——看似合理但事实错误的输出——这削弱了其可靠性。先前研究多从训练数据、目标等宏观视角探讨幻觉现象,而神经元层面的内在机制仍鲜有探索。本文从识别、行为影响及起源三个维度,对LLMs中的幻觉相关神经元(H-神经元)进行了系统性研究。在识别方面,我们证明仅需极稀疏的神经元子集(少于总神经元数的$0.1\\%$)即可可靠预测幻觉发生,且在不同场景下具有强泛化能力。在行为影响方面,受控干预实验表明这些神经元与过度顺从行为存在因果关联。关于其起源,我们追溯至预训练基础模型,发现这些神经元仍能有效预测幻觉,表明其形成于预训练阶段。本研究连接了宏观行为模式与微观神经机制,为开发更可靠的LLMs提供了新见解。