Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
翻译:将扫描图像中的文档解析为结构化格式仍然是一个重大挑战,这源于文本段落、图形、公式和表格等元素复杂地交织在一起。现有的监督微调方法通常难以在不同文档类型间泛化,导致性能不佳,特别是在分布外数据上。由于用于布局感知解析任务的高质量训练数据有限,这一问题进一步加剧。为应对这些挑战,我们提出了LayoutRL,一个通过整合归一化编辑距离、段落计数准确性和阅读顺序保持的复合奖励来优化布局理解的强化学习框架。为支持此训练,我们构建了Infinity-Doc-400K数据集,并用于训练Infinity-Parser——一个在多个领域展现出强大泛化能力的视觉-语言模型。在包括OmniDocBench、olmOCR-Bench、PubTabNet和FinTabNet在内的基准测试上的广泛评估表明,Infinity-Parser在广泛的文档类型、语言和结构复杂性上始终取得最先进的性能,显著优于专门的文档解析系统和通用视觉-语言模型。我们将发布代码、数据集和模型,以促进文档解析领域的可复现研究。