Large language models (LLMs) achieve optimal utility when their responses are grounded in external knowledge sources. However, real-world documents, such as annual reports, scientific papers, and clinical guidelines, frequently combine extensive narrative content with complex, hierarchically structured tables. While existing retrieval-augmented generation (RAG) systems effectively integrate LLMs' generative capabilities with external retrieval-based information, their performance significantly deteriorates especially processing such heterogeneous text-table hierarchies. To address this limitation, we formalize the task of Heterogeneous Document RAG, which requires joint retrieval and reasoning across textual and hierarchical tabular data. We propose MixRAG, a novel three-stage framework: (i) hierarchy row-and-column-level (H-RCL) representation that preserves hierarchical structure and heterogeneous relationship, (ii) an ensemble retriever with LLM-based reranking for evidence alignment, and (iii) multi-step reasoning decomposition via a RECAP prompt strategy. To bridge the gap in available data for this domain, we release the dataset DocRAGLib, a 2k-document corpus paired with automatically aligned text-table summaries and gold document annotations. The comprehensive experiment results demonstrate that MixRAG boosts top-1 retrieval by 46% over strong text-only, table-only, and naive-mixture baselines, establishing new state-of-the-art performance for mixed-modality document grounding.
翻译:大型语言模型(LLMs)在响应基于外部知识源时实现最佳效用。然而,现实世界中的文档,如年度报告、科学论文和临床指南,通常将大量叙述性内容与复杂的层次化结构表格相结合。尽管现有的检索增强生成(RAG)系统有效地将LLMs的生成能力与基于外部检索的信息集成,但其性能在处理此类异构的文本-表格层次结构时显著下降。为应对这一局限,我们形式化了异构文档RAG任务,该任务要求跨文本和层次化表格数据进行联合检索与推理。我们提出MixRAG,一种新颖的三阶段框架:(i)保留层次结构和异构关系的层次化行-列级别(H-RCL)表示,(ii)基于LLM重排序的证据对齐集成检索器,以及(iii)通过RECAP提示策略实现的多步推理分解。为弥补该领域可用数据的不足,我们发布了数据集DocRAGLib,这是一个包含2k文档的语料库,配有自动对齐的文本-表格摘要和黄金文档标注。综合实验结果表明,MixRAG在仅文本、仅表格和朴素混合基线方法上,将top-1检索性能提升了46%,为混合模态文档接地任务确立了新的最先进性能。