对照病人亚人口群改进最坏情况预测模型业绩的方法比较 (A comparison of approaches to improve worst-case predictive model performance over patient subpopulations)

Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via data collection techniques that increase the effective sample size or reduce the level of noise in the prediction problem.

翻译：临床结果的预测模型,在患者人口中平均准确性,对于某些亚群体来说,可能大大低于临床结果的预测模型,可能引入或加剧护理机会和质量方面的不平等。示范培训方法,旨在最大限度地提高各亚群体最坏情况模型的绩效,如分布强的优化(DRO),试图解决这一问题,而不会带来更多伤害。我们对DRO进行大规模的经验研究和若干标准学习程序变异,以确定模式开发和选择方法,这些方法与从电子健康记录数据中学习预测模型的标准方法相比,持续改进子群体的分解和最坏的绩效。在评估过程中,我们推广DRO方法,以便能够对用于评估最坏情况业绩的衡量标准进行规格。我们分析模型,预测住院死亡率、长期停留和30天重诊住院治疗,并利用密集护理数据预测住院死亡率。我们发现,除相对较少的例外情况外,对于每个受检查的患者子群体而言,没有比使用整个培训数据集的标准学习程序更好的方法。这些结果意味着,如果通过收集标准方法来提高患者的绩效,那么通过收集方法就能提高必要的水平。