We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.
翻译:我们提出了FinMMDocR,这是一个新颖的双语多模态基准,用于评估多模态大语言模型在真实世界金融数值推理任务上的表现。相较于现有基准,我们的工作实现了三项主要进展。(1) 场景感知:在1200个专家标注的问题中,57.9%的问题融入了12种类型的隐含金融场景(例如,投资组合管理),挑战模型基于假设进行专家级推理的能力;(2) 文档理解:837份中/英文文档涵盖9种类型(例如,公司研究报告),平均长度达50.8页且包含丰富的视觉元素,在金融文档的广度和深度上均显著超越了现有基准;(3) 多步计算:问题平均需要11步推理(5.3步信息提取 + 5.7步计算),其中65.0%的问题需要跨页证据(平均涉及2.4页)。表现最佳的MLLM仅达到58.0%的准确率,且不同的检索增强生成方法在此任务上表现出显著的性能差异。我们期望FinMMDocR能够推动MLLM及推理增强方法在真实世界复杂多模态推理任务上的改进。