Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available at https://git.io/J3VO0.
翻译:由多个自我关注层构成的变换器对适用于不同数据模式的通用原始学习有着强烈的希望,包括最近在计算机愿景中实现最新技术(SOTA)标准准确性方面的突破。基本上尚未探索的是其稳健性评估和归属。在这项工作中,我们研究了愿景变换器(VIT)在防止常见腐败和扰动、分布转移和自然对抗组合方面的强健性。我们使用六种与稳健分类有关的不同图像网数据集,对VIT模型和SOTA神经神经网络(CNNs,Big-Transer)进行全面的性能比较。我们通过六种系统设计的实验,然后提出分析,提供定量和定性指标,解释ViT公司为何确实是更强健健健的学习者。举例说,用较少的参数和类似的数据集和训练前组合,ViNetA公司提供了28.10%的顶级精确度,比BiT公司可比的变式高出4.3x。我们关于图像掩码、四倍频度敏感度敏感度和在离子子网络的能量变异性实验中传播。 3在可更新的磁谱分析中显示。