While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoning LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized chemical knowledge, ChemFG, annotating the presence of functional groups in molecules and the changes of functional groups during chemical reactions, to enhance the model's understanding of the fundamental principles and internal logic of chemistry. Then, we propose a mixed-source distillation method that integrates expertise in atomized knowledge with general reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the model's reliability, transparency, and practicality in real-world human-AI collaboration scenarios.
翻译:尽管大语言模型(LLMs)已取得显著进展,但其在化学等科学领域的应用仍受限于浅层的领域理解与有限的推理能力。本研究聚焦于化学领域,开发了一种化学推理大语言模型 ChemDFM-R。我们首先构建了一个全面的原子化化学知识数据集 ChemFG,标注了分子中官能团的存在以及化学反应过程中官能团的变化,以增强模型对化学基本原理与内在逻辑的理解。随后,我们提出了一种混合源蒸馏方法,将原子化知识的专业性与通用推理技能相结合,并通过领域特定的强化学习进一步提升化学推理能力。在多种化学基准测试上的实验表明,ChemDFM-R 实现了前沿性能,同时提供可解释的、基于推理依据的输出。进一步的案例研究阐明了显式推理链如何显著提升模型在真实人机协作场景中的可靠性、透明性与实用性。