A multiclass classifier is said to be top-label calibrated if the reported probability for the predicted class -- the top-label -- is calibrated, conditioned on the top-label. This conditioning on the top-label is absent in the closely related and popular notion of confidence calibration, which we argue makes confidence calibration difficult to interpret for decision-making. We propose top-label calibration as a rectification of confidence calibration. Further, we outline a multiclass-to-binary (M2B) reduction framework that unifies confidence, top-label, and class-wise calibration, among others. As its name suggests, M2B works by reducing multiclass calibration to numerous binary calibration problems, each of which can be solved using simple binary calibration routines. We instantiate the M2B framework with the well-studied histogram binning (HB) binary calibrator, and prove that the overall procedure is multiclass calibrated without making any assumptions on the underlying data distribution. In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we find that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling. Code for this work is available at \url{https://github.com/aigen/df-posthoc-calibration}.
翻译:如果报告的预测等级概率 -- -- 顶级标签 -- -- 被校准,以最高标签为条件,则多级分类分类据说是顶级标签校准。在密切相关和受欢迎的信任校准概念中,没有在顶级标签上设置这样的调试,因此难以为决策解释信任校准。我们建议将顶级标签校准作为信任校准的校准校准。此外,我们提出一个多级到二级的削减框架(M2B),这种框架可以统一信心、顶级标签和等级校准等。正如其名称所示,M2B通过将多级校准降低到许多双级校准问题,其中每一个问题都可以使用简单的二进制校准常规来解决。我们用精心研究的直方图 binning (HB) 双进制校准器对M2B框架进行即时,并且证明总体程序是多级校准的,没有在基本数据分布上做出任何假设。在CFAR-10和CIFAR-100上四个深网结构的经验性评估中,我们发现MB将多级校准到高级的MB级校准程序作为高级的校准标准。在高级/高级校准/高级校准/高级的校准/级校准程序。我们在高级/低级校准中可以达到高级的MB。