Machine learning models $-$ now commonly developed to screen, diagnose, or predict health conditions $-$ are evaluated with a variety of performance metrics. An important first step in assessing the practical utility of a model is to evaluate its average performance over an entire population of interest. In many settings, it is also critical that the model makes good predictions within predefined subpopulations. For instance, showing that a model is fair or equitable requires evaluating the model's performance in different demographic subgroups. However, subpopulation performance metrics are typically computed using only data from that subgroup, resulting in higher variance estimates for smaller groups. We devise a procedure to measure subpopulation performance that can be more sample-efficient than the typical subsample estimates. We propose using an evaluation model $-$ a model that describes the conditional distribution of the predictive model score $-$ to form model-based metric (MBM) estimates. Our procedure incorporates model checking and validation, and we propose a computationally efficient approximation of the traditional nonparametric bootstrap to form confidence intervals. We evaluate MBMs on two main tasks: a semi-synthetic setting where ground truth metrics are available and a real-world hospital readmission prediction task. We find that MBMs consistently produce more accurate and lower variance estimates of model performance for small subpopulations.
翻译:目前通常为筛选、诊断或预测健康状况而开发的机器学习模型以美元为单位,用各种性能衡量对美元进行评估。评估模型实际效用的一个重要第一步是评价其在整个受关注人群中的平均性能。在许多环境下,模型还必须在预先界定的亚群中作出良好的预测。例如,显示模型公平或公平需要评估模型在不同人口分组中的性能。然而,子人口性能衡量标准通常仅使用该分组的数据来计算,从而得出较小群体的差异估计值。我们设计了一个程序,以衡量比典型的子抽样估计值更高效的子人口性能。我们建议使用一个评价模型来说明预测模型的有条件分布,在预先界定的亚群群群中,以美元作为基于模型的估计值。我们的程序包括模型的检查和验证,我们建议对传统的非参数性能测距进行计算效率的近似近度,以形成信任度。我们评估了两个主要任务:一种半合成的基底线测量值比典型的亚群性能效率更高。我们建议使用一个模型,说明一种有条件的模型,用以得出更精确的地测算和低实际的医院的性能。