The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. In light of these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.
翻译:近年来,ChatGPT和AlphaFold等基础模型的成功引发了人们对构建类似于电子病历(EMRs)的模型的兴趣,以改善患者护理和医院运营。然而,最近的炒作掩盖了我们对这些模型能力的重要缺口。我们审查了80多种在非成像EMR数据(即临床文本和/或结构化数据)上训练的基础模型,并创建一个分类法,描述了它们的体系结构、训练数据和潜在用例。我们发现,大多数模型是在小范围的临床数据集(例如MIMIC-III)或广泛的公共生物医学语料库(例如PubMed)上训练的,并在未提供有关它们对健康系统的有用性的有意义见解的任务上进行评估。鉴于这些发现,我们提出了一个改进的评估框架,以更紧密地基于在医疗保健中具有重要价值的指标来衡量临床基础模型的利益。