Generalization error predictors (GEPs) aim to predict model performance on unseen distributions by deriving dataset-level error estimates from sample-level scores. However, GEPs often utilize disparate mechanisms (e.g., regressors, thresholding functions, calibration datasets, etc), to derive such error estimates, which can obfuscate the benefits of a particular scoring function. Therefore, in this work, we rigorously study the effectiveness of popular scoring functions (confidence, local manifold smoothness, model agreement), independent of mechanism choice. We find, absent complex mechanisms, that state-of-the-art confidence- and smoothness- based scores fail to outperform simple model-agreement scores when estimating error under distribution shifts and corruptions. Furthermore, on realistic settings where the training data has been compromised (e.g., label noise, measurement noise, undersampling), we find that model-agreement scores continue to perform well and that ensemble diversity is important for improving its performance. Finally, to better understand the limitations of scoring functions, we demonstrate that simplicity bias, or the propensity of deep neural networks to rely upon simple but brittle features, can adversely affect GEP performance. Overall, our work carefully studies the effectiveness of popular scoring functions in realistic settings and helps to better understand their limitations.
 翻译:泛化误差预测器(GEP)的目标是通过从样本级别分数中推导出数据集级别误差估计,预测模型在未见分布上的性能。然而,GEP通常利用不同的机制(例如回归器、阈值函数、校准数据集等),推导出这种误差估计,这可能会掩盖特定评分函数的优势。因此,在本研究中,我们严格研究了流行评分函数的有效性(置信度、局部流形平滑、模型一致性),独立于机制选择。我们发现,在没有复杂机制的情况下,最先进的置信度和平滑度评分无法在估计分布漂移和损坏下的误差时优于简单的模型一致性评分。此外,在训练数据已被损坏(例如标签噪声、测量噪声、欠采样)的现实设置中,我们发现模型一致性得分仍然表现良好,并且集成多样性对于提高其性能非常重要。最后,为了更好地了解评分函数的局限性,我们证明了简易偏见或深度神经网络依赖于简单但脆弱的特征的倾向会对GEP性能产生不利影响。总的来说,我们仔细研究了流行评分函数在现实环境中的有效性,并有助于更好地了解它们的局限性。