Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.
翻译:在培训数据中代表性不足的分组中,机器学习模式往往表现不佳。然而,对于导致亚人口变化的机制的变异,以及算法如何在规模上对各种变化进行概括化分析,人们很少了解。在这项工作中,我们提供了对亚人口变化的细微分析。我们首先提议了一个统一框架,分解并解释分组的常见变化。然后,我们建立了一个综合基准,根据12个真实世界数据集对视觉、语言和保健领域进行评价的20种最先进的算法。在培训超过10 000个模型的结果中,我们揭示出对未来空间进步的观察令人感兴趣的。首先,现有的算法只改进分组对某些类型变化的稳健性,而不是其他。此外,尽管目前的算法依靠群体附加说明的验证数据来选择模型,但我们发现,在没有任何分组信息的情况下,基于最差级准确性的简单选择标准是令人惊讶的。最后,与仅仅旨在提高最差群体准确性的现有工作(WGA)不同,我们展示了WGA和其他重要指标之间的根本权衡,强调需要仔细选择基准/Hargis/Hargus 数据。