We study the problem of post-hoc calibration for multiclass classification, with an emphasis on histogram binning. Multiple works have focused on calibration with respect to the confidence of just the predicted class (or 'top-label'). We find that the popular notion of confidence calibration [Guo et al., 2017] is not sufficiently strong -- there exist predictors that are not calibrated in any meaningful way but are perfectly confidence calibrated. We propose a closely related (but subtly different) notion, top-label calibration, that accurately captures the intuition and simplicity of confidence calibration, but addresses its drawbacks. We formalize a histogram binning (HB) algorithm that reduces top-label multiclass calibration to the binary case, prove that it has clean theoretical guarantees without distributional assumptions, and perform a methodical study of its practical performance. Some prediction tasks require stricter notions of multiclass calibration such as class-wise or canonical calibration. We formalize appropriate HB algorithms corresponding to each of these goals. In experiments with deep neural nets, we find that our principled versions of HB are often better than temperature scaling, for both top-label and class-wise calibration. Code for this work will be made publicly available at https://github.com/aigen/df-posthoc-calibration.
翻译:我们研究多级分类的后热校准问题,重点是直方图分流。多部作品侧重于校准,只对预测的等级(或“顶级标签”)的信任度进行校准。我们发现信任校准的流行概念[Guo等人,2017年]不够强 -- -- 有一些没有以任何有意义的方式校准但完全信任校准的预测器。我们提出了一个密切相关(但下级不同)的概念,即顶级标签校准,准确捕捉信任校准的直觉和简单度,但处理其缺点。我们正式确定一个将顶级标签多级校准减为二级的直方图(HB)算法。我们发现,在深层神经网的实验中,我们的HB/多级校准(HB)的校准版本通常比公开温度校准(https-comma)的校准/最高校准/最高校准标准更好。我们发现,高级校准的HB/高级校准(http-comma)的校准版本通常比高。