This paper investigates the post-hoc calibration of confidence for "exploratory" machine learning classification problems. The difficulty in these problems stems from the continuing desire to push the boundaries of which categories have enough examples to generalize from when curating datasets, and confusion regarding the validity of those categories. We argue that for such problems the "one-versus-all" approach (top-label calibration) must be used rather than the "calibrate-the-full-response-matrix" approach advocated elsewhere in the literature. We introduce and test four new algorithms designed to handle the idiosyncrasies of category-specific confidence estimation. Chief among these methods is the use of kernel density ratios for confidence calibration including a novel, bulletproof algorithm for choosing the bandwidth. We test our claims and explore the limits of calibration on a bioinformatics application (PhANNs) as well as the classic MNIST benchmark. Finally, our analysis argues that post-hoc calibration should always be performed, should be based only on the test dataset, and should be sanity-checked visually.
翻译:本文调查了“ 探索性” 机器学习分类问题的信任度后校准问题。 这些问题之所以困难,是因为人们继续希望推开那些类别中具有足够实例的界限,以便在整理数据集时能够概括这些类别的有效性。 我们争辩说,对于这些问题,必须使用“ 一反一全” 方法(顶标签校准),而不是文献中其他地方提倡的“ 校准- 全反应- 矩阵” 方法。 我们提出并测试四种新的算法,这些算法旨在处理特定类别信任估计的特性。 这些方法中最主要的是使用内核密度比率来校准信任度,包括用于选择带宽的新的、防弹性算法。 我们测试我们的主张,并探索生物学应用校准的限度( PhANNs) 以及经典的 MNIST基准。 最后,我们的分析认为, 后校准应始终以测试数据集为基础, 并且应当以理智检查为视觉。