Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce Evidence-Guided Diagnostic Reasoning (EGDR), which guides LLMs to generate structured diagnostic hypotheses by interleaving evidence extraction with logical reasoning grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: the Knowledge Attribution Score (KAS) and the Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and increases DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. Overall, EGDR yields up to +45% accuracy and +36% DCS gains over baseline methods, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.
翻译:大语言模型(LLMs)在自动化临床诊断中展现出潜力,但其不透明的决策机制以及与诊断标准的有限对齐,阻碍了信任和临床采纳。为解决这一挑战,我们提出一个两阶段诊断框架,以增强透明度、可信度和可靠性。首先,我们引入证据引导诊断推理(EGDR),通过将证据提取与基于DSM-5标准的逻辑推理交织,引导LLMs生成结构化诊断假设。其次,我们提出诊断置信度评分(DCS)模块,通过两个可解释指标——知识归因分数(KAS)和逻辑一致性分数(LCS),评估生成诊断的事实准确性和逻辑一致性。在带有伪标签的D4数据集上评估,EGDR在五种LLMs中均优于直接上下文提示和思维链(CoT)方法。例如,在OpenBioLLM上,EGDR将准确率从0.31(直接提示)提升至0.76,并将DCS从0.50提高至0.67。在MedLlama上,DCS从0.58(CoT)上升至0.77。总体而言,EGDR相比基线方法实现了高达+45%的准确率和+36%的DCS提升,为可信的AI辅助诊断提供了基于临床、可解释的基础。