Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key model-comparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.
翻译:良好的决策需要机器学习模型来提供值得信赖的信任分数。 为此,最近的工作侧重于校准错误, 即模型分数的过错或信任度。 然而, 与普遍的看法相反, 校准是不够的: 即使是精度和完美校准最强的分类师, 其可信度分数也远远低于真实的后方概率。 这是由于组损失, 由具有相同信任分数的样本造成, 但真实的后方概率不同。 正确的评分规则理论显示, 鉴于校准损失, 个人错误特征的缺失部分是分类损失。 虽然对校准损失的估算师很多, 但对于标准环境下的分类损失却并不存在。 在此, 我们提出一个估测算器来估计分类损失。 我们用它来研究视觉和 NLP 的现代神经网络结构结构结构。 我们发现, 组损失在不同的结构中差别很大, 并且它是最精确、 校准的模型中一个关键的模型比较系数。 我们还显示, 分布变化导致高的分组损失。