Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X, a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model's (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, strong random cropping hurts robustness on smaller objects. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.
翻译:深层学习愿景系统在可靠性至关重要的应用程序中广泛部署。 但是,即使今天的最佳模型可能无法在模型的构成、照明或背景不同时识别一个对象。 虽然现有的基准表面示例对模型来说具有挑战性,但它们并不能解释为什么出现这种错误。 为解决这一需要,我们引入了图像Net-X,这是一套16个人文说明,包括图像、背景或整个图像Net-1k验证集以及12k培训图像的随机子集,配有图像Net-X,我们调查了2 200个当前识别模型,并研究作为模型(1)结构函数的错误类型,例如变异器对组合对象的功能,(2)学习范例,例如受监督的相对于自我监督的,以及(3)培训程序,例如数据增强。不管作出这些选择,我们发现模型在整个图像网-X类别中都有一贯的失败模式。 我们还发现,虽然数据增强可以提高某些因素的稳健性,但也会对其他因素产生溢出效应。 例如,强大的随机裁断使较小物体的稳健性功能受损。 这些洞察范模式共同建议,我们更深入地了解未来愿景的模型。