Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the "true" probability that event, or outcome, will occur. Unfortunately, for modern deep neural networks this is not the case, they are often observed to be poorly calibrated. Additionally, these deep learning approaches make use of large numbers of model parameters, motivating the use of Bayesian, or ensemble approximation, approaches to handle issues with parameter estimation. This paper explores the application of calibration schemes to deep ensembles from both a theoretical perspective and empirically on a standard image classification task, CIFAR-100. The underlying theoretical requirements for calibration, and associated calibration criteria, are first described. It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members. On CIFAR-100 the impact of calibration for ensemble prediction, and associated calibration is evaluated. Additionally the situation where multiple different topologies are combined together is discussed.
翻译:使用统计方法进行广泛应用的基本假设是,从统计模型获得的概率代表了发生事件或结果的“真实”概率。不幸的是,对于现代深神经网络来说,情况并非如此,往往发现这些网络的校准差强。此外,这些深层次的学习方法使用大量模型参数,鼓励使用巴伊西亚或混合近似,处理参数估计问题的方法。本文从理论角度探讨校准计划对深层组合的应用,并从理论角度和经验角度探讨对标准图像分类任务CIFAR-100的深度组合的应用。首先说明校准和相关校准标准的基本理论要求。显示,经过校准的共性成员不一定产生经过适当校准的共性能预测,如果组合预测对它的性能进行精确校准,则不能超过经校准的混合成员的平均性能。在CIFAR-100上,对校准可编目的预测及相关校准工作的影响进行了实验。此外,还共同讨论了多种顶级的合并情况。