Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
翻译:大语言模型(LLMs)在医学领域的应用日益广泛,但其可靠性与安全性问题限制了实际采纳。现有评估方法要么孤立地测试医学事实知识,要么在不验证正确性的前提下评估患者层面的推理能力,存在关键缺陷。我们提出了Medival基准,该基准将MIMIC-IV电子健康记录(EHRs)与基于UMLS及其他生物医学词汇表构建的统一知识库相连接。MediEval能够在真实患者情境中生成多样化的事实性与反事实医学陈述,通过一个同时考量知识基础与情境一致性的四象限框架实现系统性评估。利用该框架,我们识别出当前专有、开源及领域特定大语言模型普遍存在的关键失效模式,包括幻觉支持与真相反转。为应对这些风险,我们提出了反事实风险感知微调(CoRFu),这是一种基于DPO的非对称惩罚方法,专门针对不安全混淆。CoRFu在基础模型上实现了+16.4的宏观F1分数提升,并完全消除了真相反转错误,在提升准确性的同时显著增强了安全性。