Since out-of-distribution generalization is a generally ill-posed problem, various proxy targets (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) were studied across different research programs resulting in different recommendations. While sharing the same aspirational goal, these approaches have never been tested under the same experimental conditions on real data. In this paper, we take a unified view of previous work, highlighting message discrepancies that we address empirically, and providing recommendations on how to measure the robustness of a model and how to improve it. To this end, we collect 172 publicly available dataset pairs for training and out-of-distribution evaluation of accuracy, calibration error, adversarial attacks, environment invariance, and synthetic corruptions. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting. Our findings confirm that in- and out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies.
翻译:由于分配外的概括化一般是一个普遍不良的问题,因此在不同的研究方案中对各种代用指标(例如校准、对抗性稳健性、算法腐败、轮班腐败)进行了研究,得出了不同的建议。这些办法虽然具有相同的追求目标,但从未在同样的实验条件下对真实数据进行过测试。在本文件中,我们统一看待以往的工作,突出我们从经验角度处理的信息差异,并就如何衡量模型的稳健性以及如何改进模型提出建议。为此,我们收集了172对公开提供的数据集,用于对准确性、校准错误、对抗性攻击、环境差异和合成腐败进行培训和分配外评价。我们微调了31k网络,这些网络来自不同结构的众多和几发式。我们的研究结果证实,分配范围外的缩略图往往会共同增加,但表明它们之间的关系在很大程度上依赖于数据集,而且一般而言比以往规模较小的研究所显示的要更加细化和复杂。