无限解析器：基于布局感知强化学习的扫描文档解析 (Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing)

Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

翻译：将扫描图像中的文档解析为结构化格式仍然是一个重大挑战，因为其包含文本段落、图形、公式和表格等复杂交织的元素。现有的监督微调方法通常难以在不同文档类型间泛化，导致性能不佳，尤其在分布外数据上。由于布局感知解析任务的高质量训练数据有限，这一问题进一步加剧。为应对这些挑战，我们提出了LayoutRL——一种通过整合归一化编辑距离、段落计数准确性和阅读顺序保持的复合奖励来优化布局理解的强化学习框架。为支持此训练，我们构建了Infinity-Doc-400K数据集，并用于训练Infinity-Parser。该视觉语言模型在多个领域展现出强大的泛化能力。在OmniDocBench、olmOCR-Bench、PubTabNet和FinTabNet等基准上的广泛评估表明，Infinity-Parser在各类文档类型、语言和结构复杂度上均持续取得最先进的性能，显著优于专用文档解析系统和通用视觉语言模型。我们将公开代码、数据集和模型，以促进文档解析领域的可复现研究。