Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
翻译:文档视觉问答(DocVQA)面临处理长篇幅多模态文档(文本、图像、表格)和执行跨模态推理的双重挑战。当前基于文档的检索增强生成(DocRAG)方法仍受限于其以文本为中心的方法,经常遗漏关键的视觉信息。该领域也缺乏用于评估多模态证据选择与整合的稳健基准。我们提出了MMDocRAG,这是一个包含4,055个专家标注的问答对、具有多页面跨模态证据链的综合基准。我们的框架引入了创新的指标来评估多模态引用选择,并支持生成交织文本与相关视觉元素的答案。通过对60个VLM/LLM模型和14个检索系统的大规模实验,我们识别了在多模态证据检索、选择和整合方面持续存在的挑战。关键发现表明,先进的专有LVMs比开源替代方案表现出更优的性能。同时,它们在使用多模态输入相较于纯文本输入时显示出中等优势,而开源替代方案则表现出显著的性能下降。值得注意的是,经过微调的LLMs在使用详细图像描述时实现了实质性改进。MMDocRAG建立了一个严格的测试平台,并为开发更稳健的多模态DocVQA系统提供了可操作的见解。我们的基准和代码可在 https://mmdocrag.github.io/MMDocRAG/ 获取。