Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible to measure the reliability of an individual prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel local recalibration method LoRe, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.
翻译:概率分解器在预测中输出信任度分数,这些信任分数应该加以校准,即它们应该反映预测的可靠性。能够将预期校准错误(欧洲经委会)等标准度值降到最低,从而将预期校准错误(欧洲经委会)等标准度值精确测量整个人口平均的可靠性。然而,一般而言,无法衡量个人预测的可靠性。在这项工作中,我们建议地方校准错误(LCE)以缩小平均可靠性和个人可靠性之间的差距。对于每个个人预测,LCE衡量一套类似预测的平均可靠性,即它们应该反映预测的可靠性,即它们应该反映预测的可靠性。在预先训练的特性空间和预测模型信心的硬化计划上,以内核函数来量化相似性。我们从理论上表明,LCE可以从数据中有效估算样本,从经验上发现,它揭示的校准模式比欧洲经委会所探测到的要精细。我们的主要结果是一种新的地方校正校准方法LRe,目的是提高个人预测的信任度分数,减少LCE。实验上,我们显示,我们的校准方法可以产生更准确的可信度和图表排序。