Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention-based networks surpass traditional Convolutional Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on the standard accuracy and computation cost, lacking the investigation of the intrinsic influence on model robustness and generalization. In this work, we conduct systematic evaluation on components of ViTs in terms of their impact on robustness to adversarial examples, common corruptions and distribution shifts. We find some components can be harmful to robustness. By using and combining robust components as building blocks of ViTs, we propose Robust Vision Transformer (RVT), which is a new vision transformer and has superior performance with strong robustness. We further propose two new plug-and-play techniques called position-aware attention scaling and patch-wise augmentation to augment our RVT, which we abbreviate as RVT*. The experimental results on ImageNet and six robustness benchmarks show the advanced robustness and generalization ability of RVT compared with previous ViTs and state-of-the-art CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple robustness leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at \url{https://github.com/alibaba/easyrobust}.
翻译:视觉变异器(VIT)及其经改进的变异器的最新进展表明,在多数视觉任务中,基于自我注意的网络超越了传统的革命神经网络(CNNs),但是,现有的VIT侧重于标准准确性和计算成本,缺乏对模型稳健性和概括性内在影响的调查,我们在工作中,从对对抗性范例、常见腐败和分布变化的稳健性的影响的角度,对VIT的部件进行系统评价。我们发现,某些组成部分可能有害于强健性。通过使用和合并作为VT的构件的强健构件,我们提出了Robust 愿景变异器(RVT),这是一个新的视觉变异器,其性能很强。我们进一步提出了两种称为“定位关注度增益和补差增强我们的RVT”的新插件技术。关于图像网络的实验结果和6个强健性基准显示RVT与先前的VTs和州级S-Arpal-S-stal-stampheal imal-IMS-S-S-S-stal-stal-stalge-stal-stal-IMS-IMS-IMS-S-S-IRCDxxxxxxxxxxxxxxxxxxxxxxxxxxxx