Recent state-of-the-art vision models introduced new architectures, learning paradigms, and larger pretraining data, leading to impressive performance on tasks such as classification. While previous generations of vision models were shown to lack robustness to factors such as pose, it's unclear the extent to which this next generation of models are more robust. To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position, background, lighting, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they're present during training. We consider a catalog of recent vision models, including vision transformers (ViT), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find out-of-the-box, even today's best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of diversity to generalize -- though eventually robustness did improve. When diversity is only seen for some classes however, we found models did not generalize to other classes, unless the classes were very similar to those seen varying during training. We hope our work will shed further light on the blind spots of SoTA models and spur the development of more robust vision models.
翻译:最近最先进的愿景模型引入了新的架构、学习范式和更大的培训前数据,导致在诸如分类等任务上取得令人印象深刻的业绩。 虽然前几代愿景模型显示缺乏强健性, 但尚不清楚这代下一代模型更稳健的程度。 为了研究这一问题, 我们开发了一个700多万图像的数据集, 其面貌、 位置、 背景、 照明和规模都有受控的变化。 我们不仅研究最近最先进的模型有多稳健, 而且还研究模型在多大程度上能够广泛体现在培训期间存在的各种因素的差异。 我们考虑的是, 最新的愿景模型目录, 包括视觉变异器(ViT), 自我监督的模型, 如蒙蔽自动转换器(MAE), 和在更大的数据集(如CLIP) 上培训的模型。 我们发现, 最佳的模型, 甚至今天的最佳模型, 也比不上在面貌、 大小和背景上的共同变化。 当一些样本在培训期间, 我们发现模型需要相当一部分的多样化的模型来概括化 -- 除非最终有更稳健的模型。 然而, 当我们看到我们发现, 在常规的阶段里, 我们的模型会看到, 这些变的模型将会看到, 这些变的变的模型, 当我们发现, 这些变的变的变的变的变的模。