Several studies have empirically compared in-distribution (ID) and out-of-distribution (OOD) performance of various models. They report frequent positive correlations on benchmarks in computer vision and NLP. Surprisingly, they never observe inverse correlations suggesting necessary trade-offs. This matters to determine whether ID performance can serve as a proxy for OOD generalization. This short paper shows that inverse correlations between ID and OOD performance do happen in real-world benchmarks. They may have been missed in past studies because of a biased selection of models. We show an example of the pattern on the WILDS-Camelyon17 dataset, using models from multiple training epochs and random seeds. Our observations are particularly striking on models trained with a regularizer that diversifies the solutions to the ERM objective. We nuance recommendations and conclusions made in past studies. (1) High OOD performance does sometimes require trading off ID performance. (2) Focusing on ID performance alone may not lead to optimal OOD performance: it can lead to diminishing and eventually negative returns in OOD performance. (3) Our example reminds that empirical studies only chart regimes achievable with existing methods: care is warranted in deriving prescriptive recommendations.
翻译:一些研究对各种模型的分布(ID)和分配(OOD)的绩效进行了经验性比较,对各种模型的分布(ID)和分配(OOD)的绩效进行了经验性比较,它们报告了计算机视觉和NLP基准方面的经常正相关关系。令人惊讶的是,它们从未观察到表明必要权衡的反相关关系。这关系到确定ID性能能否作为OOD一般化的替代物。这份简短的论文表明,ID和OOD性能之间的反相关关系在现实世界基准中确实发生。它们在过去的研究中可能由于对模型的选择偏差而被遗漏。我们展示了WILDS-Camelyon17数据集模式的范例,使用了来自多个培训时代和随机种子的模型。我们的意见特别突出的是,在经过常规化因素培训的模型中,这种模型能够说明机构风险管理目标的解决方案。我们细化建议和以往研究中得出的结论。(1) OODD性能高有时需要交换ID性能。(2) 仅侧重于身份性能可能不会导致最佳的OD性绩效:这可能导致OD性绩效的减少并最终负回报。(3)我们的例子提醒了经验性研究仅能研究仅指指指向现有方法可以实现的制度。