This paper investigates the post-hoc calibration of confidence for "exploratory" machine learning classification problems. The difficulty in these problems stems from the continuing desire to push the boundaries of which categories have enough examples to generalize from when curating datasets, and confusion regarding the validity of those categories. We argue that for such problems the "one-versus-all" approach (top-label calibration) must be used rather than the "calibrate-the-full-response-matrix" approach advocated elsewhere in the literature. We introduce and test four new algorithms designed to handle the idiosyncrasies of category-specific confidence estimation. Chief among these methods is the use of kernel density ratios for confidence calibration including a novel, bulletproof algorithm for choosing the bandwidth. We test our claims and explore the limits of calibration on a bioinformatics application (PhANNs)1 as well as the classic MNIST benchmark2. Finally, our analysis argues that post-hoc calibration should always be performed, should be based only on the test dataset, and should be sanity-checked visually.
翻译:本文调查了“ 探索性” 机器学习分类问题的信任度后校准问题。 这些问题之所以困难,是因为人们继续希望推开那些类别中具有足够实例的界限,以便在整理数据集时能够概括这些类别的有效性。 我们争辩说,对于这些问题,必须使用“ 一反一全” 的方法(顶级标签校准)1,而不是文献中其他地方提倡的“ 校准- 全反应- 矩阵” 方法。 我们提出并测试四个新的算法,这些算法旨在处理特定类别信任估计的特性。 其中一个方法主要是使用内核密度比率来校准信任度,包括用于选择带宽的新型防弹算法。 我们测试我们的主张,并探索生物文理学应用的校准限度( PhANNs)1 以及经典的MNIST基准2 。 最后,我们的分析认为, 后校准应始终以测试数据集为基础,并应当进行视觉检查。