Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.
翻译:深度研究系统通过迭代推理与证据收集,彻底改变了大型语言模型解决复杂问题的方式。然而,现有系统本质上仍局限于文本网络数据,忽略了蕴含在多模态文档中的海量知识。处理此类文档需要:能够保留视觉语义(如图形、表格、图表和公式)的精细解析、保持结构连贯性的智能分块,以及跨模态的自适应检索——这些能力在现有系统中均告缺失。为此,我们提出了Doc-Researcher,一个通过三个集成组件来弥合这一差距的统一系统:(i)深度多模态解析,在保留布局结构和视觉语义的同时,从分块到文档级别创建多粒度表征;(ii)支持纯文本、纯视觉及混合检索范式的系统化检索架构,具备动态粒度选择能力;(iii)迭代式多智能体工作流,能够分解复杂查询、逐步积累证据,并综合来自不同文档和模态的全面答案。为了进行严谨评估,我们引入了M4DocBench,这是首个面向多模态、多跳、多文档、多轮深度研究的基准测试。该基准包含158个专家标注的问题,其完整证据链覆盖304份文档,能够测试现有基准无法评估的能力。实验表明,Doc-Researcher达到了50.6%的准确率,比最先进的基线方法高出3.4倍,这验证了有效的文档研究不仅需要更好的检索,更根本的是需要能够保持多模态完整性并支持迭代研究的深度解析。我们的工作为在多模态文档集合上进行深度研究确立了一种新范式。