Uncertainty in probabilistic classifiers predictions is a key concern when models are used to support human decision making, in broader probabilistic pipelines or when sensitive automatic decisions have to be taken. Studies have shown that most models are not intrinsically well calibrated, meaning that their decision scores are not consistent with posterior probabilities. Hence being able to calibrate these models, or enforce calibration while learning them, has regained interest in recent literature. In this context, properly assessing calibration is paramount to quantify new contributions tackling calibration. However, there is room for improvement for commonly used metrics and evaluation of calibration could benefit from deeper analyses. Thus this paper focuses on the empirical evaluation of calibration metrics in the context of classification. More specifically it evaluates different estimators of the Expected Calibration Error ($ECE$), amongst which legacy estimators and some novel ones, proposed in this paper. We build an empirical procedure to quantify the quality of these $ECE$ estimators, and use it to decide which estimator should be used in practice for different settings.
翻译:当模型用于支持人类决策时,在更广泛的概率性管道中,或在必须作出敏感的自动决定时,概率分类预测的不确定性是一个关键问题。研究显示,大多数模型在本质上没有很好地校准,这意味着其决定分数与事后概率不一致。因此,能够校准这些模型,或在学习这些模型的同时执行校准,因此对最近的文献重新感兴趣。在这方面,适当评估校准对于量化处理校准的新贡献至关重要。然而,对于常用的测量和评估校准方法,仍有改进的余地,可以从更深入的分析中受益。因此,本文侧重于在分类方面对校准指标的经验评估。更具体地说,它评估了预期校准错误的不同估计者(美元),其中包括本文中提议的遗留估计者和一些新的估计者。我们建立了一种经验程序,以量化这些以美元计价的校准器的质量,并用它来决定不同环境的实际应用的估测点。