Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.
翻译:大语言模型(LLMs)已在多个领域展现出能力,但其在基于叙述性医学病例进行罕见病诊断方面的性能仍未得到充分探索。我们引入了一个新颖的数据集,包含从医学电视剧《豪斯医生》中提取的176个症状-诊断对,该剧在医学教育中被验证可用于罕见病识别教学。我们评估了四种先进的大语言模型,如GPT 4o mini、GPT 5 mini、Gemini 2.5 Flash和Gemini 2.5 Pro,在基于叙述的诊断推理任务上的表现。结果显示性能存在显著差异,准确率从16.48%到38.64%不等,新一代模型表现出2.3倍的性能提升。尽管所有模型在罕见病诊断方面都面临重大挑战,但不同架构间观察到的改进为未来发展指明了有前景的方向。我们经过教育验证的基准为叙述性医学推理建立了基线性能指标,并为推进人工智能辅助诊断研究提供了一个公开可访问的评估框架。