Recent literature have shown design strategies from Convolutions Neural Networks (CNNs) benefit Vision Transformers (ViTs) in various vision tasks. However, it remains unclear how these design choices impact on robustness when transferred to ViTs. In this paper, we make the first attempt to investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions through an extensive and rigorous benchmarking. We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness. Furthermore, adversarial noise training is powerful on ViTs while fourier-domain augmentation fails. Moreover, we introduce a novel conditional method enabling input-varied augmentations from two angles: (1) Generating dynamic augmentation parameters conditioned on input images. It conduces to state-of-the-art performance on robustness through conditional convolutions; (2) Selecting most suitable augmentation strategy by an extra predictor helps to achieve the best trade-off between clean accuracy and robustness.
翻译:最近的文献显示,革命神经网络(CNNs)在各种愿景任务中为愿景变异器(VITs)提供了设计战略,但尚不清楚这些设计选择在转移给VITs时对稳健性有何影响。在本文件中,我们第一次尝试调查类似CNN的建筑设计和基于CNN的数据增强战略如何通过广泛和严格的基准对VITs的稳健性对稳健性产生影响。我们证明,重叠的补丁嵌入和进料前进网络(FFN)提高了稳健性的表现。此外,对抗性噪声培训对VTs的力量很强,而四度负载增强失败了。此外,我们引入了一种新的有条件方法,允许从两个角度进行输入变异增增:(1) 生成以输入图像为条件的动态增强参数。它通过有条件的演进图促使最先进的稳健性表现通过有条件的演进;(2) 通过额外的预测器选择最合适的最合适的增强战略有助于在清洁准确性和稳健之间实现最佳权衡。