The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
翻译:将多模态大语言模型(MLLMs)整合到化学领域有望彻底改变科学发现,然而,它们理解真实文献中密集、图形化的反应语言的能力仍未得到充分探索。本文介绍了RxnBench,这是一个多层级基准,旨在严格评估MLLMs从科学PDF文件中理解化学反应的能力。RxnBench包含两项任务:单图问答(SF-QA),该任务使用从305个精选反应方案衍生的1,525个问题,测试细粒度的视觉感知和机理推理能力;以及全文档问答(FD-QA),该任务挑战模型从108篇文章中综合信息,要求跨模态整合文本、反应方案和表格。我们对MLLMs的评估揭示了一个关键的能力差距:虽然模型擅长提取显式文本,但在深度化学逻辑和精确结构识别方面存在困难。值得注意的是,具备推理时思考能力的模型显著优于标准架构,但没有任何模型在FD-QA上达到50\%的准确率。这些发现强调了开发领域特定的视觉编码器和更强大的推理引擎以推进自主AI化学家发展的迫切需求。