Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.
翻译:文本与公式是许多文档的核心信息组成部分。准确且高效地识别两者对于开发鲁棒且可泛化的文档解析系统至关重要。近来,视觉-语言模型(VLMs)在文本与公式的统一识别方面取得了令人瞩目的成果。然而,这些模型参数量大且计算需求高,限制了其在许多应用中的使用。本文提出UniRec-0.1B,一个仅具有0.1B参数量的统一识别模型。它能够在多个层级(包括字符、单词、行、段落和文档)执行文本与公式识别。为实现此任务,我们首先构建了UniRec40M,一个包含4000万个文本、公式及其混合样本的大规模数据集,用于训练一个强大而轻量的模型。其次,我们指出了构建此类轻量级但统一专家模型时面临的两个挑战:跨层级的结构可变性以及文本内容与公式内容之间的语义纠缠。为解决这些问题,我们引入了显式引导结构理解的分层监督训练,以及分离文本与公式表示的语义解耦分词器。最后,我们开发了一个全面的评估基准,涵盖来自多个领域、具有多个层级的中英文文档。在此基准及公开基准上的实验结果表明,UniRec-0.1B在性能上超越了通用VLMs和领先的文档解析专家模型,同时实现了2-9$\times$的加速,验证了其有效性与高效性。代码库与数据集:https://github.com/Topdu/OpenOCR。