While datasets with single-label supervision have propelled rapid advances in image classification, additional annotations are necessary in order to quantitatively assess how models make predictions. To this end, for a subset of ImageNet samples, we collect segmentation masks for the entire object and $18$ informative attributes. We call this dataset RIVAL10 (RIch Visual Attributes with Localization), consisting of roughly $26k$ instances over $10$ classes. Using RIVAL10, we evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes. In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training). We find that, somewhat surprisingly, in ResNets, adversarial training makes models more sensitive to the background compared to foreground than standard training. Similarly, contrastively-trained models also have lower relative foreground sensitivity in both transformers and ResNets. Lastly, we observe intriguing adaptive abilities of transformers to increase relative foreground sensitivity as corruption level increases. Using saliency methods, we automatically discover spurious features that drive the background sensitivity of models and assess alignment of saliency maps with foregrounds. Finally, we quantitatively study the attribution problem for neural features by comparing feature saliency with ground-truth localization of semantic attributes.
翻译:虽然带有单一标签监督的数据集在图像分类方面推动快速进展,但为了定量评估模型如何作出预测,还需要附加说明。为此,我们收集了用于整个对象和18美元信息属性的分解面面罩。我们称该数据集为RIVAL10(光学属性与本地化),由大约26k美元的案例组成,超过1万美元等级。我们利用RIVAL10,评估一套广泛的模型对在前台、背景和属性中噪音腐败的敏感性。我们的分析考虑到各种最新结构(ResNet、变换器)和培训程序(CLIP、SimCLR、DeiT、Aversarial Trainations)。我们发现,在ResNets中,对抗性培训使模型对背景比标准培训更敏感。同样,对比性培训模型在变异形器和ResNet中也比表面敏感度低。最后,我们发现变异性模型的适应性适应能力与相对的地面敏感性的相对性敏感度,最终通过地面变异性特征研究,我们观察到变异性变异性模型的适应能力与地面的相对敏感度特性特性,最后通过地面特征的特征研究,通过地面变化特征的特征的特征研究,我们发现地面特征的特征的特征的特征研究,对地面特征的特征的特征的特征的特征的特征分析方法。