Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
翻译:近年来,大型语言模型(LLM)与检索增强生成(RAG)技术的进展,推动了在相关证据位于单个(单跳)或多个(多跳)段落时的问答(QA)研究。然而,许多针对周期性报告数据(如医疗记录、合规申报文件、维护日志)的现实问题,需要对所有文档进行聚合分析,其检索过程缺乏明确的终止点,且对单个遗漏段落具有高度敏感性。我们将此类问题定义为"多跳问题",并通过三个标准进行形式化描述:召回敏感性、穷举性与精确性。为研究这一场景,我们提出了PluriHopWIND——一个基于191份德文与英文真实风电行业报告构建的诊断性多语言数据集,包含48个多跳问题。实验表明,PluriHopWIND的重复性比其他常见数据集高8-40%,因而具有更高密度的干扰文档,更能反映周期性报告语料库的实际挑战。我们测试了传统RAG流程以及基于图结构与多模态的变体,发现所有测试方法在语句级F1分数上均未超过40%。基于此,我们提出PluriHopRAG架构,其遵循"逐文档检查、低成本过滤"策略:(i)将查询分解为文档级子问题;(ii)在昂贵的LLM推理前使用交叉编码器过滤器剔除无关文档。实验表明,根据基础LLM的不同,PluriHopRAG能实现18-52%的相对F1分数提升。尽管规模有限,PluriHopWIND揭示了当前QA系统在处理重复性高、干扰密集的语料库时的局限性。PluriHopRAG的性能凸显了穷举检索与早期过滤作为top-k方法替代方案的重要价值。