When deployed for risk-sensitive tasks, deep neural networks must include an uncertainty estimation mechanism. Here we examine the relationship between deep architectures and their respective training regimes, with their corresponding selective prediction and uncertainty estimation performance. We consider some of the most popular estimation performance metrics previously proposed including AUROC, ECE, AURC as well as coverage for selective accuracy constraint. We present a novel and comprehensive study of selective prediction and the uncertainty estimation performance of 523 existing pretrained deep ImageNet classifiers that are available in popular repositories. We identify numerous and previously unknown factors that affect uncertainty estimation and examine the relationships between the different metrics. We find that distillation-based training regimes consistently yield better uncertainty estimations than other training schemes such as vanilla training, pretraining on a larger dataset and adversarial training. Moreover, we find a subset of ViT models that outperform any other models in terms of uncertainty estimation performance. For example, we discovered an unprecedented 99% top-1 selective accuracy on ImageNet at 47% coverage (and 95% top-1 accuracy at 80%) for a ViT model, whereas a competing EfficientNet-V2-XL cannot obtain these accuracy constraints at any level of coverage. Our companion paper, also published in ICLR 2023 (A framework for benchmarking class-out-of-distribution detection and its application to ImageNet), examines the performance of these classifiers in a class-out-of-distribution setting.
翻译:当部署用于风险敏感任务时,深神经网络必须包含一个不确定性估计机制。 我们在这里检查深层结构及其各自的培训制度之间的关系, 以及它们相应的选择性预测和不确定性估计性能。 我们考虑以前提出的一些最受欢迎的估计性能指标, 包括AUROC、 ECE、 ARC, 以及选择性精确度限制的覆盖范围。 我们对有选择的预测和不确定性估计性能进行了新颖和全面的研究, 523个受过预先训练的深层图像网络分类者在大众储存库中可得到的这种预测性能。 我们发现许多影响不确定性估计的先前未知因素, 并检查不同指标之间的关系。 我们发现, 以蒸馏为基础的培训制度比香草培训、 大规模数据集培训和对抗性培训等其他培训计划一致产生更好的不确定性估计性能。 此外, 我们还发现了一组维特模型在不确定性估计性能绩效方面超过任何其他模型的一组。 例如,我们在图像网络上发现了一个前所未有的99%的顶级-1选择性的准确性能, 以及80%的顶级-1准确性精确性能, 而一个相互竞争的I-V2-XL在升级的图像测试中也无法在20级A级测试中获得这些级的精确度测试。