Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides a comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with MLP-Mixer and convolutional neural networks (CNNs) including ConvNeXt, and this observation also holds for certified robustness. Through frequency analysis and feature visualization, we summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less high-frequency patterns that have spurious correlation, which helps explain why ViTs are less sensitive to high-frequency perturbations than CNNs and MLP-Mixer, and there is a high correlation between how much the model learns high-frequency features and its robustness against different frequency-based perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning high-frequency features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Modern CNN designs that borrow techniques from ViTs including activation function, layer norm, larger kernel size to imitate the global attention, and patchify the images as inputs, etc., could help bridge the performance gap between ViTs and CNNs not only in terms of performance, but also certified and empirical adversarial robustness. Moreover, we show adversarial training is also applicable to ViT for training robust models, and sharpness-aware minimization can also help improve robustness, while pre-training with clean images on larger datasets does not significantly improve adversarial robustness.
翻译:推进自然语言处理和理解的成功之后, 变压器预计将给计算机视觉带来革命性的变化。 这项工作将全面研究视觉变压器( VITs) 的稳健性, 防止对抗性扰动。 在各种白色框和传输攻击设置上测试, 我们发现 VITs 与 MLP- Mixer 和 convolual 神经网络(CNNs)相比, 具有更好的对抗性强性强性, 包括ConvNeXt, 以及这一观察也具有经认证的稳健性。 通过频率分析和特征可视觉化, 我们总结了以下主要观测结果, 有助于增强 Vits 的稳健性强性:(1) ViTs 所学的功能不那么高频率变压模式, 这可以解释为什么 ViT 和 MLP- Mixer 相比, ViLP- Mixer 的对高频性扰动性能网络(CNNP) 比较弱性强性强性强性强性强性强性强性强性强性能。 在ViLT 的测试性变压性能中, 我们引入的变压性或感性能性能性能, 的升级性能和感力性能培训中, 也能够提高性能能能能的更精确性能、更精确性能、更精确性能、更精确性能性能, 也显示性能 性能性能 性能 。