Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.
翻译:超度度显示,在已知分组信息的各种环境中,稀有分组的测试精确度较低。为了更完整地了解分组信息,我们考虑分组信息未知的情况。我们调查模型大小对不同环境在实证风险最小化(ERM)下最差群体一般化的影响,这些环境有:(1) 建筑(ResNet, VGG, 或 BERT),(2) 领域(视觉或自然语言处理),(3) 模型大小(width或深度),(4) 初始化(预先训练或随机加权)。我们的系统评估表明,模型规模的扩大并不伤害,而且可能有助于机构风险管理下所有设置的最坏群体测试性能。特别是,增加预先训练的模型规模,不断提高水鸟和多NLI的性能。当分组标签未知时,我们建议实践者使用更大的预先训练模型。