H-神经元：大语言模型中幻觉相关神经元的存在性、影响及起源 (H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs)

Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

翻译：大语言模型（LLMs）常产生幻觉——看似合理但事实错误的输出——这削弱了其可靠性。先前研究多从训练数据、目标等宏观视角探讨幻觉现象，而神经元层面的内在机制仍鲜有探索。本文从识别、行为影响及起源三个维度，对LLMs中的幻觉相关神经元（H-神经元）进行了系统性研究。在识别方面，我们证明仅需极稀疏的神经元子集（少于总神经元数的$0.1\\%$）即可可靠预测幻觉发生，且在不同场景下具有强泛化能力。在行为影响方面，受控干预实验表明这些神经元与过度顺从行为存在因果关联。关于其起源，我们追溯至预训练基础模型，发现这些神经元仍能有效预测幻觉，表明其形成于预训练阶段。本研究连接了宏观行为模式与微观神经机制，为开发更可靠的LLMs提供了新见解。