Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration measures the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE_bin. We introduce a simulation framework that allows us to empirically show that ECE_bin can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, we find that ECE_bin is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE_sweep produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.
翻译:建立可靠的机器学习系统要求我们正确理解其信任度。校准测量模型信任度和大多数校准研究的准确度,重点是改进校准误差经验估计的技术,ECE_bin。我们引入了一个模拟框架,让我们能够从经验上表明,ECE_bin可以根据模型误差的性质、评价数据集的大小和垃圾箱的数量,系统地低估或高估校准误差。关键是,我们发现ECE_bin对校准无误的模型偏差更大。我们建议采用简单的校准误差指标,ECE_Sweep,其中选择了尽可能大的文件箱数量,同时保持校准功能的单一性。评估我们关于符合CFAR-10、CIFAR-100和图象网神经网络信任分数的分布情况,我们表明ECE_burp产生一个比较不偏差的校准误差估计器,因此,任何希望评价类似数据集所训练模型校准的研究人员都应该使用。