With COVID-19 now pervasive, identification of high-risk individuals is crucial. Using data from a major healthcare provider in Southwestern Pennsylvania, we develop survival models predicting severe COVID-19 progression. In this endeavor, we face a tradeoff between more accurate models relying on many features and less accurate models relying on a few features aligned with clinician intuition. Complicating matters, many EHR features tend to be under-coded, degrading the accuracy of smaller models. In this study, we develop two sets of high-performance risk scores: (i) an unconstrained model built from all available features; and (ii) a pipeline that learns a small set of clinical concepts before training a risk predictor. Learned concepts boost performance over the corresponding features (C-index 0.858 vs. 0.844) and demonstrate improvements over (i) when evaluated out-of-sample (subsequent time periods). Our models outperform previous works (C-index 0.844-0.872 vs. 0.598-0.810).
翻译:利用来自西南宾夕法尼亚州主要保健提供者的数据,我们开发了预测严重COVID-19进展的存活模型。在这项努力中,我们面临一个权衡,即依赖许多特征的更准确模型和依赖与临床直觉一致的几个特征的更不准确模型之间的权衡。复杂的是,许多EHR特征往往编码不足,降低了较小模型的准确性。在本研究中,我们开发了两套高性能风险分数:(一) 一种根据所有现有特征建立的不受限制的模型;和(二) 一条在培训风险预测员之前学习一小套临床概念的管道。学过的概念提高了相应特征的性能(C-index 0.858 vs. 0.844),并展示了(一) 在评估抽样时(随后的时段)的改进情况。我们的模型比以前的工作(C-ind 0.844-0.872 vs. 0.598-0.810) 。