Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
翻译:医院与医疗系统的运营决策直接影响患者流转、成本控制及医疗质量。尽管通用文本训练的基础模型在医学知识问答与对话评测中表现优异,但其可能缺乏支撑此类运营决策所需的专业知识。本研究推出Lang1模型系列(参数量1亿至70亿),其预训练语料融合了纽约大学朗格尼医疗中心电子健康记录中的800亿临床标记与互联网来源的6270亿标记。为在真实场景中严谨评估Lang1,我们构建了基于668,331份电子健康记录笔记的“真实医疗评估基准”,该基准涵盖五项关键任务:30天再入院预测、30天死亡率预测、住院时长预测、共病编码及保险拒赔预测。在零样本设定下,通用模型与专业模型在五项任务中的四项表现欠佳(AUROC介于36.6%-71.7%),仅死亡率预测例外。经微调后,Lang1-1B模型在多项任务上超越了参数量达其70倍的微调通用模型及参数量达其671倍的零样本模型,AUROC分别提升3.64%-6.75%和1.66%-23.66%。研究同时观察到跨任务缩放效应:对多任务联合微调可提升其他任务表现。Lang1-1B能有效迁移至分布外场景,包括其他临床任务及外部医疗系统。研究结果表明,医院运营的预测能力需要显式的监督微调,而基于电子健康记录的领域内预训练能显著提升微调效率。这些发现支持了新兴观点:专业大语言模型可在特定任务中与通用模型竞争,并证明构建有效的医疗系统人工智能需结合领域内预训练、监督微调及超越代理基准的真实场景评估。