This paper introduces the concept of resolving power to describe the capacity of an evaluation metric to discriminate between models of similar quality. This capacity depends on two attributes: 1. The metric's response to improvements in model quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power as a metric's sampling uncertainty scaled by its signal. Resolving power's primary application is to compare the discriminating capacity of threshold-free evaluation metrics, such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation study compares the AUROC and the AUPRC in a variety of contexts. The analysis suggests that the AUROC generally has greater resolving power, but that the AUPRC is superior in some conditions, such as those where high-quality models are applied to low prevalence outcomes. The paper concludes by proposing an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model.
翻译:本文引入了解决能力的概念来描述评价指标区分类似质量模型的能力。这个能力取决于两个属性:1. 指标对模型质量改进的反应(其信号),2. 指标的抽样可变性(其噪声)。本文将解决能力定义为指标的采样不确定性与其信号相乘之比。解决能力的主要应用是比较无阈值评价指标的区分能力,例如受试者工作特征曲线下面积(AUROC)和精确率-召回率曲线下面积(AUPRC)。一项模拟研究比较了AUROC和AUPRC在各种情况下的表现。分析表明,AUROC通常具有更大的解决能力,但在某些条件下,如将高质量模型应用于低发病率结果时,AUPRC更优。本文最后提出了一种经验方法来估计解决能力,该方法可应用于任何数据集和任何初始分类模型。