InterpDetect：用于检测检索增强生成中幻觉的可解释信号 (InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.

翻译：检索增强生成（RAG）通过整合外部知识来缓解幻觉问题，然而模型生成的输出常常与检索内容不一致。准确的幻觉检测需要区分外部上下文和参数化知识的贡献，而先前的方法通常将二者混为一谈。我们研究了RAG幻觉的产生机制，发现当深层前馈网络（FFN）模块将参数化知识不成比例地注入残差流时，幻觉便会产生。为解决此问题，我们探索了一种基于外部上下文分数和参数化知识分数的机制检测方法。使用Qwen3-0.6b模型，我们计算了各层和注意力头中的这些分数，并训练了基于回归的分类器来预测幻觉。我们的方法在先进大语言模型（GPT-5、GPT-4.1）和检测基线（RAGAS、TruLens、RefChecker）上进行了评估。此外，基于Qwen3-0.6b信号训练的分类器能够泛化至GPT-4.1-mini的响应，这证明了代理模型评估的潜力。我们的研究结果表明，机制信号可作为RAG系统中高效且可泛化的幻觉检测预测指标。