Molecular editing and optimization are multi-step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool-guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two-stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool-use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent-7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed "thinking" baselines such as DeepSeek-R1; MolEditAgent-3B approaches the performance of much larger open "thinking" models like Qwen3-32B-think. In molecular optimization, MolOptAgent-7B (trained on MolEditAgent-7B) surpasses the best closed "thinking" baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements.
翻译:分子编辑与优化是多步骤问题,需要迭代改进性质,同时保持分子的化学有效性和结构相似性。我们将这两项任务形式化为序列化的工具引导决策过程,并提出了MolAct——一种智能体强化学习框架,该框架采用两阶段训练范式:首先构建编辑能力,随后在复用已习得编辑行为的同时优化性质。据我们所知,这是首个将分子设计形式化为智能体强化学习问题的研究,其中LLM智能体学习交替进行推理、工具使用和分子优化。该框架支持智能体进行多轮交互,调用化学工具进行有效性检查、性质评估和相似性控制,并利用其反馈优化后续编辑。我们实例化MolAct框架以训练两个模型系列:用于分子编辑任务的MolEditAgent和用于分子优化任务的MolOptAgent。在分子编辑任务中,MolEditAgent-7B在添加、删除和替换编辑上分别实现了100、95和98的有效编辑,超越了DeepSeek-R1等强大的闭源“思维”基线模型;MolEditAgent-3B的性能接近Qwen3-32B-think等大得多的开源“思维”模型。在分子优化任务中,基于MolEditAgent-7B训练的MolOptAgent-7B在LogP指标上超越了最佳闭源“思维”基线(如Claude 3.7),在溶解度指标上保持竞争力,同时在其他目标上维持均衡性能。这些结果表明,将分子设计视为多步骤、工具增强的过程是实现可靠且可解释改进的关键。