Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.
翻译:大型语言模型(LLMs)在医学问答领域展现出潜力,但它们往往忽视了专业人员所依赖的领域专业知识,例如临床学科领域(如创伤、气道)和认证级别(如急救医疗技术员、护理人员)。现有方法通常采用通用提示或检索策略,而未利用这种结构化上下文,限制了在高风险场景下的性能。我们通过EMSQA填补了这一空白,这是一个包含24.3K个选择题的数据集,涵盖10个临床学科领域和4个认证级别,并配有精心策划、与学科领域对齐的知识库(40K份文档和200万词元)。基于EMSQA,我们提出了(i)Expert-CoT,一种将思维链(CoT)推理条件化于特定临床学科领域和认证级别的提示策略,以及(ii)ExpertRAG,一种检索增强生成流程,将回答基于学科领域对齐的文档和真实世界患者数据。在4个LLMs上的实验表明,Expert-CoT相比普通CoT提示提升了高达2.05%。此外,将Expert-CoT与ExpertRAG结合,相比标准RAG基线实现了高达4.59%的准确率提升。值得注意的是,经过专业知识增强的32B LLMs通过了所有计算机自适应EMS认证模拟考试。