Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the "semantic illusion": semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
翻译:检索增强生成(RAG)系统尽管基于检索到的证据,仍然容易产生幻觉。当前的检测方法依赖于语义相似性和自然语言推理(NLI),但其基本限制尚未被严格表征。我们应用保形预测于幻觉检测,提供有限样本覆盖保证,从而能精确量化检测能力。使用约600个示例的校准集,我们在合成幻觉(Natural Questions)上实现了94%的覆盖率和0%的误报率。然而,在涵盖多个大语言模型(GPT-4、ChatGPT、GPT-3、Llama-2、Mistral)的三个真实幻觉基准测试中,基于嵌入的方法——包括最先进的OpenAI text-embedding-3-large和交叉编码器模型——表现出不可接受的误报率:在HaluEval上为100%,在RAGTruth上为88%,在WikiBio上为50%。关键的是,GPT-4作为大语言模型法官在同一数据上仅实现7%的误报率(95%置信区间:[3.4%, 13.7%]),证明该任务可通过推理解决。我们称之为“语义幻觉”:语义上合理的幻觉保持与源文档的相似性,同时引入嵌入无法察觉的事实错误。这一限制在嵌入架构、大语言模型生成器和任务类型中持续存在,表明基于嵌入的检测对于生产环境中的RAG部署是不够的。