LLM特定效用：检索增强生成的新视角 (LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation)

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.

翻译：检索增强生成（RAG）通过整合外部知识来增强大语言模型（LLMs）。传统检索主要关注相关性，而RAG的有效性取决于检索段落的效用，即其在促进生成准确且全面答案方面的有用性。现有研究通常将效用视为通用属性，忽略了不同LLMs因其内部知识和理解能力的差异，可能从同一段落中获益不同。本文中，我们引入并系统研究了LLM特定效用的概念。通过在多个数据集和LLMs上进行大规模实验，我们证明人工标注的段落对LLMs并非最优，且真实效用段落在不同LLMs间不可迁移。这些发现凸显了在RAG研究中采用LLM特定效用的必要性。我们的研究结果表明，部分人工标注段落并非特定LLMs的真实效用段落，部分原因在于查询和段落对LLMs的可读性存在差异，而困惑度是衡量这种趋势的关键指标。基于这些发现，我们提出了LLM特定效用判断的基准评估流程。我们在六个数据集上评估了现有效用判断方法，发现尽管基于伪答案的言语化方法表现稳健，但LLMs难以有效评估效用——既无法对已知查询拒绝所有段落，也无法为未知查询选择真正有用的段落。