Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
翻译:深革命神经网络(CNNs) 长期以来一直是计算机视觉任务的首选架构。 最近, 视觉变异器(View 变异器)等基于变异器的架构已经匹配甚至超过了 ResNet, 用于图像分类。 然而, 变异器架构的细节 — — 例如使用非重叠的补丁 — — 让人怀疑这些网络是否同样强大。 在本文中, 我们广泛研究了各种维变器模型的稳健度衡量标准, 并将结果与 ResNet 基线进行比较。 我们调查了输入扰动的稳健性, 以及模拟扰动的稳健性。 我们发现, 在经过足够数量的数据培训之前, ViT 模型至少和 ResNet 对应方在广泛的扰动方面一样强大。 我们还发现, 变异器在几乎清除任何单一层方面都很强大, 而后层的启动机制彼此高度关联, 但是它们在分类中扮演着重要的角色 。