最好不要更公平:数据增加能够减少分组退化吗? (Better May Not Be Fairer: Can Data Augmentation Mitigate Subgroup Degradation?)

It is no secret that deep learning models exhibit undesirable behaviors such as learning spurious correlations instead of learning correct relationships between input/output pairs. Prior works on robustness study datasets that mix low-level features to quantify how spurious correlations affect predictions instead of considering natural semantic factors due to limitations in accessing realistic datasets for comprehensive evaluation. To bridge this gap, in this paper we first investigate how natural background colors play a role as spurious features in image classification tasks by manually splitting the test sets of CIFAR10 and CIFAR100 into subgroups based on the background color of each image. We name our datasets CIFAR10-B and CIFAR100-B. We find that while standard CNNs achieve human-level accuracy, the subgroup performances are not consistent, and the phenomenon remains even after data augmentation (DA). To alleviate this issue, we propose FlowAug, a semantic DA method that leverages the decoupled semantic representations captured by a pre-trained generative flow. Experimental results show that FlowAug achieves more consistent results across subgroups than other types of DA methods on CIFAR10 and CIFAR100. Additionally, it shows better generalization performance. Furthermore, we propose a generic metric for studying model robustness to spurious correlations, where we take a macro average on the weighted standard deviations across different classes. Per our metric, FlowAug demonstrates less reliance on spurious correlations. Although this metric is proposed to study our curated datasets, it applies to all datasets that have subgroups or subclasses. Lastly, aside from less dependence on spurious correlations and better generalization on in-distribution test sets, we also show superior out-of-distribution results on CIFAR10.1 and competitive performances on CIFAR10-C and CIFAR100-C.

翻译：为了弥补这一差距,我们首先调查自然背景颜色如何在图像分类任务中扮演虚假特征,例如学习假相亲,而不是学习输入/产出对配之间的正确关系。我们以前曾对稳健性研究数据集进行了研究,这些数据集混合了低层次特征,以量化虚假性关联如何影响预测,而不是考虑自然语义因素,因为获取现实的数据集以进行全面评价受到限制。为了缩小这一差距,我们在本文件中首先调查自然背景颜色如何通过手工将CIFAR10 和 CIFAR100 的测试组分解成基于每张图像背景颜色的分组。我们命名了我们的数据集 CIFAR10-B 和 CIFAR100-B 。我们发现,虽然标准CNN达到了人的准确性能,但分组的性能却不一致,甚至在数据扩充(DA)。为了缓解这一问题,我们建议FlowAug, 一种以精细性DA方法来利用通过事先训练的精准性能变异性变异性变异性表示。实验结果表明,FLAL-FAL-S-S-S-SDA(我们从其他类型的更一致的直系)分组取得更好的结果。我们更一致的比其他类型的直系更一致化结果,在CAR10 和CFAL-ILDRDO 上显示一般的性变的性变现数据。