Medical Decision-Making (MDM) is a complex process requiring substantial domain-specific expertise to effectively synthesize heterogeneous and complicated clinical information. While recent advancements in Large Language Models (LLMs) show promise in supporting MDM, single-LLM approaches are limited by their parametric knowledge constraints and static training corpora, failing to robustly integrate the clinical information. To address this challenge, we propose the Expertise-aware Multi-LLM Recruitment and Collaboration (EMRC) framework to enhance the accuracy and reliability of MDM systems. It operates in two stages: (i) expertise-aware agent recruitment and (ii) confidence- and adversarial-driven multi-agent collaboration. Specifically, in the first stage, we use a publicly available corpus to construct an LLM expertise table for capturing expertise-specific strengths of multiple LLMs across medical department categories and query difficulty levels. This table enables the subsequent dynamic selection of the optimal LLMs to act as medical expert agents for each medical query during the inference phase. In the second stage, we employ selected agents to generate responses with self-assessed confidence scores, which are then integrated through the confidence fusion and adversarial validation to improve diagnostic reliability. We evaluate our EMRC framework on three public MDM datasets, where the results demonstrate that our EMRC outperforms state-of-the-art single- and multi-LLM methods, achieving superior diagnostic performance. For instance, on the MMLU-Pro-Health dataset, our EMRC achieves 74.45% accuracy, representing a 2.69% improvement over the best-performing closed-source model GPT- 4-0613, which demonstrates the effectiveness of our expertise-aware agent recruitment strategy and the agent complementarity in leveraging each LLM's specialized capabilities.
翻译:医疗决策是一个复杂的过程,需要大量领域专业知识来有效整合异构且复杂的临床信息。尽管大型语言模型的最新进展在支持医疗决策方面展现出潜力,但单一LLM方法受限于其参数化知识约束和静态训练语料,难以稳健地整合临床信息。为应对这一挑战,我们提出基于专业感知的多LLM招募与协作框架,以提升医疗决策系统的准确性与可靠性。该框架分两个阶段运行:(i)基于专业感知的智能体招募;(ii)基于置信度与对抗驱动的多智能体协作。具体而言,在第一阶段,我们使用公开可获取的语料库构建LLM专业能力表,以捕捉多个LLM在不同医疗科室类别和查询难度级别上的专业特长。该表支持在推理阶段为每个医疗查询动态选择最优LLM作为医疗专家智能体。在第二阶段,我们利用选定的智能体生成带有自评估置信度分数的回答,随后通过置信度融合与对抗验证进行整合,以提升诊断可靠性。我们在三个公开医疗决策数据集上评估了EMRC框架,结果表明EMRC优于当前最先进的单LLM与多LLM方法,实现了更优的诊断性能。例如,在MMLU-Pro-Health数据集上,EMRC达到74.45%的准确率,相比性能最佳闭源模型GPT-4-0613提升2.69%,这证明了我们基于专业感知的智能体招募策略以及智能体在利用各LLM专业能力方面的互补有效性。