Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.
翻译:临床医学中的治疗决策是一个高风险领域,其中人工智能指导与患者特征、疾病进程和药理因素之间的复杂相互作用交织。药物推荐、治疗方案制定和不良反应预测等任务需要基于可靠生物医学知识的稳健多步推理。以TxAgent为代表的智能体人工智能方法通过迭代式检索增强生成(RAG)应对这些挑战。TxAgent采用微调后的Llama-3.1-8B模型,动态生成并执行对统一生物医学工具套件(ToolUniverse)的函数调用,整合FDA药物API、OpenTargets和Monarch资源以确保获取最新治疗信息。相较于通用RAG系统,医疗应用存在严格的安全约束,使得推理轨迹与工具调用序列的准确性至关重要。这些考量催生了将词元级推理和工具使用行为作为显式监督信号的评估方案。本文基于我们参与CURE-Bench NeurIPS 2025挑战赛的经验提出见解,该赛事通过评估正确性、工具利用率和推理质量的指标对治疗推理系统进行基准测试。我们分析了函数(工具)调用的检索质量如何影响整体模型性能,并展示了通过改进工具检索策略实现的性能提升。本研究成果荣获开放科学卓越奖,完整信息详见https://curebench.ai/。