``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.
翻译:“有效稳健度”衡量分配业绩所可以预测的额外的分配外稳健度。现有的有效稳健度评估通常使用图像网络等单一测试集来评估ID的准确性。在评价不同数据分布培训模型时,这有问题,例如,比较在图像网与LAION培训的零发语言图像预培训模型方面受过培训的模型。在本文件中,我们提出了新的有效稳健度评估指标,以比较在不同数据分布方面受过培训的模型的有效稳健性。为了做到这一点,我们控制涵盖所有评价模型培训分布的多个身份测试集的准确性。我们的新评估指标提供了对效力稳健性的更好估计,并解释了在只考虑一个ID数据集时展示的类似零发CLIP模型的惊人有效稳健性收益,而我们在评估中所获收益则减少。