For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE_bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two reliable calibration-error estimators: the debiased estimator (Brocker, 2012; Ferro and Fricker, 2012) and a method we propose, ECE_sweep, which uses equal-mass bins and chooses the number of bins to be as large as possible while preserving monotonicity in the calibration function. With these estimators, we observe improvements in the effectiveness of recalibration methods and in the detection of model miscalibration.
翻译:AI 系统要可靠, 它对它的决定表示的信任必须与其准确性相符。 为了评估匹配的程度, 示例通常以信任为主, 并且对每宾平均信任和准确性进行比较。 校准中的大多数研究侧重于减少校准错误的经验测量技术, ECE_ bin。 我们则侧重于评估这一经验测量的统计偏差, 并找出更好的估测器。 我们建议了一个框架, 通过这个框架, 我们可以计算特定估测器对特定大小的评价数据集的偏差。 框架包括综合模型输出, 其统计数据与大众数据集中的普通神经结构相同。 我们发现, 以同等质量的垃圾桶( 实例数量) 为主的估测器的偏差低于以等宽的垃圾桶为主的估测器。 我们的结果表明, 两个可靠的校准度- 估测器是: 分的估测器( 模型, 2012 ; Ferro和 Fricker, 2012 ), 以及一个我们建议的方法, EC_ 校准模型, 使用等量的神经结构, 使用等量的书箱, 并选择测量结果的校准方法。 校正 校正 校正 校正 校正 校正 校正 校正 。