The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely investigated in the literature like CNNs. This work studies the robustness of ViT variants 1) against different Lp-based adversarial attacks in comparison with CNNs, 2) under adversarial examples (AEs) after applying preprocessing defense methods and 3) under the adaptive attacks using expectation over transformation (EOT) framework. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.
翻译:香草视觉变压器( VIT) 的主要部分是吸引模仿输入图像全球背景的力量的注意块。 为了更好的表现, VIT 需要大规模的培训数据。 为了克服这种数据饥饿限制,许多VIT网络, 或混合VIT, 建议在培训中将当地背景包括在内。 VIT 及其对抗性攻击变体的稳健性在CNN等文献中没有得到广泛调查。 这项工作研究VIT 变体的强性(1) 与CNN相比,针对基于LP的不同 LP 对抗性攻击的强性(1) 强性,与CNN相比,2 在对抗性例子(AEs) 之下,在使用对变换(EOT) 框架的预期进行处理前防御方法和3 下,VIT 之前的防御性攻击。 为此,我们在图像网-1k 的1000 图像上进行了一系列实验,然后提供的分析显示Vanilla VIT 或混合 ViT 病毒攻击比CNN 的强。 例如,我们发现Villa ViT 或混合VIT 比CNN 以 Lp 攻击为基攻击和根据适应性直观性攻击的图像分析, 提供了更强性攻击的防御性攻击, 性攻击, 性攻击, 与高频 地图 。 与高频 。