The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based calibration and metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant variable-based calibration error as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing recalibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based recalibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.
翻译:在高取量域部署机器学习分类师需要对模型预测进行明确校准的信任分数。 在本文中,我们引入了基于变量的校准概念,以给模型的校准特性定出一个感兴趣的变量,将传统的基于分的校准和诸如预期校准错误(欧洲经委会)等衡量标准普遍化。我们特别发现,具有近效的欧洲经委会模型,作为数据特征的函数,可以显示基于变量的校准错误。我们从理论上和实践上对多个众所周知的数据集展示了这一现象,并表明在应用现有校准方法后,该现象可以持续下去。为减轻这一问题,我们提出了检测、可视化和量化基于变量校准错误的战略。然后我们审视了当前基于分的校准方法的局限性,并探索了可能的修改。最后,我们讨论了这些发现的影响,强调,理解校准超越简单综合计量对公正和模型可解释性等努力至关重要。