Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-based input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this work, we study the robustness of ViT to patch-wise perturbations. Surprisingly, we find that ViTs are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Furthermore, we discover that the attention mechanism greatly affects the robustness of vision transformers. Specifically, the attention module can help improve the robustness of ViT by effectively ignoring natural corrupted patches. However, when ViTs are attacked by an adversary, the attention mechanism can be easily fooled to focus more on the adversarially perturbed patches and cause a mistake. Based on our analysis, we propose a simple temperature-scaling based method to improve the robustness of ViT against adversarial patches. Extensive qualitative and quantitative experiments are performed to support our findings, understanding, and improvement of ViT robustness to patch-wise perturbations across a set of transformer-based architectures.
翻译:视觉变换器( VIT) 最近的进步显示了其在图像分类方面的令人印象深刻的绩效, 这使它成为了革命神经网络( CNN) 的有希望的替代方案。 与CNN不同, VIT 代表了一个输入图像作为图像补补补顺序的顺序。 基于补丁的输入图像表示让以下问题变得有趣: 个人输入图像补丁被自然腐败或对抗性扰动所困扰时, VIT 与CNN 相比, VIT 是如何表现的? 但是, 在这项工作中, 我们研究 VIT 的稳健性对补补补的调度。 令人惊讶的是, 我们发现 VIT 比CNN 更适合自然腐蚀的补补补补, 而它们更易受对抗性补补补补。 此外, 我们发现关注机制对视觉变压器的稳健性影响。 具体地说, 注意模块可以帮助提高 VIT 的稳健性, 有效忽视自然腐败补补补补补。 但是, 当 VIT 被基于仇的对敌, 轻度机制很容易被愚弄, 将注意力机制更专注于关注更多关注于对抗性床补补补补补补补补补补补补补补补补补。 基于我们的分析, 我们的平的平的平的平的平的平调调的架构, 我们建议的平的平调的平调的平调的平的平调的平调的平调的平调的平调的平的平调的平调制方法基于我们的平的平的平的平的平的平的平的平调制性平的平的平的平的平的平的平的平的平的平调。