With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy. However, it remains unclear how their unique architecture impact robustness towards common corruptions. In this paper, we make the first attempt to probe into the robustness gap among ViT variants and explore underlying designs that are essential for robustness. Through an extensive and rigorous benchmarking, we demonstrate that simple architecture designs such as overlapping patch embedding and convolutional feed-forward network (FFN) can promote the robustness of ViTs. Moreover, since training ViTs relies heavily on data augmentation, whether previous CNN-based augmentation strategies that are targeted at robustness purposes can still be useful is worth investigating. We explore different data augmentation on ViTs and verify that adversarial noise training is powerful while fourier-domain augmentation is inferior. Based on these findings, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.
翻译:随着视觉变异器(Viet Transformers)在各种计算机视觉任务方面取得巨大进步,最近的文献提出了各种香草ViT的变种,以提高效率和效力;然而,仍然不清楚它们独特的结构结构如何影响对常见腐败的稳健性。在本文中,我们第一次尝试探究维特变异器之间的稳健性差距,并探索对于稳健性至关重要的基本设计。通过广泛而严格的基准,我们证明简单的建筑设计,例如重叠的补丁嵌入和进料前进网络(FFN)可以促进维特的稳健性。此外,由于维特的训练在很大程度上依赖于数据增强,以前以强健为目标的CNN增强战略是否仍然值得调查。我们探索维特变异器的不同数据增强,并核实对抗性噪声培训是强大的,而四维维特增强能力则较差。基于这些发现,我们引入了一种以输入图像为条件的动态增强参数的新的有条件方法,为常见腐败提供最先进的强健性。