Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
翻译:大规模化学反应数据集对化学领域的人工智能研究至关重要。然而,现有化学反应数据常以论文中的图像形式存在,导致其无法被机器读取,也难以用于训练机器学习模型。针对这一挑战,我们提出了用于化学反应图解解析(RxnDP)任务的RxnCaption框架。该框架将传统基于坐标预测的解析过程重构为图文生成问题,这正是大型视觉语言模型(LVLM)天然擅长的任务。我们引入了一种称为“边界框与索引作为视觉提示”(BIVP)的策略,利用我们最先进的分子检测器MolYOLO,在输入图像上预先绘制分子边界框和索引,从而将下游解析转化为自然语言描述问题。大量实验表明,BIVP策略在简化模型设计的同时,显著提升了结构提取质量。我们进一步构建了RxnCaption-11k数据集,其规模比现有真实文献基准数据集大一个数量级,并包含涵盖四种布局原型的平衡测试子集。实验证明,RxnCaption-VL在多项指标上均达到了最先进的性能。我们相信,本方法、数据集及模型将推动化学文献中结构化信息提取的进展,并促进人工智能在化学领域更广泛的应用。我们将在GitHub上公开数据、模型和代码。