Every uncalibrated classifier has a corresponding true calibration map that calibrates its confidence. Deviations of this idealistic map from the identity map reveal miscalibration. Such calibration errors can be reduced with many post-hoc calibration methods which fit some family of calibration maps on a validation dataset. In contrast, evaluation of calibration with the expected calibration error (ECE) on the test set does not explicitly involve fitting. However, as we demonstrate, ECE can still be viewed as if fitting a family of functions on the test data. This motivates the fit-on-the-test view on evaluation: first, approximate a calibration map on the test data, and second, quantify its distance from the identity. Exploiting this view allows us to unlock missed opportunities: (1) use the plethora of post-hoc calibration methods for evaluating calibration; (2) tune the number of bins in ECE with cross-validation. Furthermore, we introduce: (3) benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and (4) novel calibration and evaluation methods using new calibration map families PL and PL3.
翻译:每个未校准的分类器都有相应的真实校准图,可以校准其信任度。从身份地图上对理想化地图的偏差显示校准错误。这样的校准错误可以用许多符合校准图的校准后方法来减少。对比之下,对校准和测试数据集中预期校准错误(ECE)的评价并不明确涉及适切性。然而,正如我们所显示的那样,ECE仍然可以被视为是否与测试数据上的一组函数相匹配。这促使对评估进行适切的测试视图:首先,在测试数据上接近校准图,其次,量化其与身份的距离。利用这一视图,我们可以释放错失的机会:(1) 使用过多的校准后校准方法来评价校准校准校准;(2) 用交叉校准法调整ECE的垃圾箱数量。此外,我们介绍:(3) 在真实校准地图可以非常精确地估算的伪真实数据基准;(4) 使用新的校准图家庭PL和PL3的新型校准和评估方法。