愿景变换者是强健的学习者 (Vision Transformers are Robust Learners)

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available here: https://git.io/J3VO0.

翻译：由多个自我关注层组成的变换器对适用于不同数据模式的通用原始学习提出了强烈的承诺,包括最近在计算机愿景中实现最新工艺(SOTA)标准准确性和更佳参数效率方面的突破。由于自我关注有助于模型系统地将输入数据中存在的不同组成部分系统地对齐,因此有理由根据模型稳健性基准对其业绩进行调查。在这项工作中,我们研究了愿景变换器(VIT)对于常见的腐败和扰动、分布变化和自然对抗性实例的强性。我们使用六种与稳健分类有关的不同图像网数据集,对VIT模型和SOT变动神经网络(CNNs,Big-Transtrafer)进行全面性业绩比较。我们通过六种系统设计的实验,提出定量和定性指标来解释VIT的确是更强健健健的学习者。例如,参数和类似的数据集和训练前组合,ViT在图像网-A上给出了28.10%的顶级-10%的精确度,这比BiT的可比较变版本要高。我们关于变频频度的图像变频度的图像分析是:我们变制变制的变制的图像变频频的图像变码。