We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
翻译:我们推出olmOCR 2,这是我们强大的OCR系统家族中的最新成员,用于将数字化印刷文档(如PDF)转换为干净、自然顺序的纯文本。olmOCR 2由olmOCR-2-7B-1025驱动,这是一个专门的7B视觉语言模型(VLM),采用带可验证奖励的强化学习(RLVR)进行训练,其中我们的奖励是一组多样化的二进制单元测试。为了扩展单元测试的创建,我们开发了一个流程,用于生成具有多样化和挑战性布局的合成文档、已知真实HTML源代码以及提取的测试用例。我们证明,在这些测试用例上进行RL训练,可在我们的英语OCR基准测试olmOCR-Bench上实现最先进的性能,与先前版本相比,在数学公式转换、表格解析和多栏布局方面改进最大。我们根据宽松的开源许可证发布我们的模型、数据和代码。