Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention-based networks surpass traditional Convolutional Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on the standard accuracy and computation cost, lacking the investigation of the intrinsic influence on model robustness and generalization. In this work, we conduct systematic evaluation on components of ViTs in terms of their impact on robustness to adversarial examples, common corruptions and distribution shifts. We find some components can be harmful to robustness. By using and combining robust components as building blocks of ViTs, we propose Robust Vision Transformer (RVT), which is a new vision transformer and has superior performance with strong robustness. We further propose two new plug-and-play techniques called position-aware attention scaling and patch-wise augmentation to augment our RVT, which we abbreviate as RVT*. The experimental results on ImageNet and six robustness benchmarks show the advanced robustness and generalization ability of RVT compared with previous ViTs and state-of-the-art CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple robustness leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at \url{https://git.io/Jswdk}.
翻译:视觉变异器(VIT)及其经改进的变体的最新进展显示,在多数视觉任务中,基于自我注意的网络超越了传统的革命神经网络(CNNs),但是,现有的VIT侧重于标准精确度和计算成本,缺乏对模型稳健性和一般化内在影响的调查,在这项工作中,我们从对对抗性范例、常见腐败和分布变化的稳健性的影响的角度对VIT的组成部分进行系统评价。我们发现,某些组成部分可能有害于强健性。通过使用和结合强健的构件作为VTs的构件,我们提出了Robust愿景变异器(RVT),这是一个新的视觉变异器,其性能很强。我们进一步提议了两种新的插座和游戏技术,即定位注意力的缩放和配对增强我们的RVT,我们将其缩写为RVT*。关于图像网络的实验结果和6个强健性基准显示RVT与前VTs和州级S-stal-stat级S-stal 图像网络* 还将在S-stal-ch-stal-stal-stal-stal-stal-stal-stal-stal-stal-stal-stal-st-stalmet=S.