Several studies have empirically compared in-distribution (ID) and out-of-distribution (OOD) performance of various models. They report frequent positive correlations on benchmarks in computer vision and NLP. Surprisingly, they never observe inverse correlations suggesting necessary trade-offs. This matters to determine whether ID performance can serve as a proxy for OOD generalization. This paper shows that inverse correlations between ID and OOD performance do happen in real-world benchmarks. They could be missed in past studies because of a biased selection of models. We show an example on the WILDS-Camelyon17 dataset, using models from multiple training epochs and random seeds. Our observations are particularly striking with models trained with a regularizer that diversifies the solutions to the ERM objective. We nuance recommendations and conclusions made in past studies. (1) High OOD performance may sometimes require trading off ID performance.(2) Focusing on ID performance alone may not lead to optimal OOD performance: it can lead to diminishing and eventually negative returns in OOD performance. (3) Our example reminds that empirical studies only chart regimes achievable with existing methods: care is warranted in deriving prescriptive recommendations.
翻译:一些研究对各种模型的分布(ID)和分配(OOD)的绩效进行了经验性比较,并报告了计算机视觉和NLP基准方面经常出现的正相关关系。令人惊讶的是,它们从未观察到表明必要权衡的反相关关系。这关系到确定ID性能能否作为OOD一般化的替代物。本文表明,ID和OOD性能之间的反相关关系确实发生在现实世界的基准中。过去的研究可能由于选择模型的偏差而忽略了它们。我们在WILDS-Camelyon17数据集上展示了一个实例,使用了来自多个培训时代和随机种子的模型。我们的意见特别引人注目,因为经过正规化的模型培训,使机构风险管理目标的解决方案多样化。我们细化了以往研究中提出的建议和结论。(1) 高OOD性能有时可能需要对ID性能进行交易。(2) 仅注重ID性能可能不会导致最佳OD性能:这可能导致OD性能的减少并最终负回报。(3) 我们的例子提醒人们,经验性研究只能用现有方法绘制出能够实现的制度:在提出规范性建议时必须谨慎。