Multi-class classification methods that produce sets of probabilistic classifiers, such as ensemble learning methods, are able to model aleatoric and epistemic uncertainty. Aleatoric uncertainty is then typically quantified via the Bayes error, and epistemic uncertainty via the size of the set. In this paper, we extend the notion of calibration, which is commonly used to evaluate the validity of the aleatoric uncertainty representation of a single probabilistic classifier, to assess the validity of an epistemic uncertainty representation obtained by sets of probabilistic classifiers. Broadly speaking, we call a set of probabilistic classifiers calibrated if one can find a calibrated convex combination of these classifiers. To evaluate this notion of calibration, we propose a novel nonparametric calibration test that generalizes an existing test for single probabilistic classifiers to the case of sets of probabilistic classifiers. Making use of this test, we empirically show that ensembles of deep neural networks are often not well calibrated.
翻译:----
多分类方法能够生成一组概率分类器,例如集成学习方法,来模拟先验不确定性和后验不确定性。通常情况下,先验不确定性通过Bayes误差来量化,后验不确定性通过分类器集合的大小来表示。本文提出将常用于校准单个概率分类器的概念扩展至评估多个概率分类器集合的后验不确定性表示的有效性。我们称一组概率分类器为校准的当且仅当可以找到校准的凸组合。为了评估此校准概念,本文提出了一种新型的非参数校准测试,将现有的单个概率分类器测试推广到概率分类器集合的情况。通过此测试,我们实验证明了深度神经网络集合通常并不校准。