反对立攻击的强力 (Reveal of Vision Transformers Robustness against Adversarial Attacks)

Attention-based networks have achieved state-of-the-art performance in many computer vision tasks, such as image classification. Unlike Convolutional Neural Network (CNN), the major part of the vanilla Vision Transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. This power is data hunger and hence, the larger the training data the better the performance. To overcome this limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely invested in the literature. Some robustness attributes were revealed in few previous works and hence, more insight robustness attributes are yet unrevealed. This work studies the robustness of ViT variants 1) against different $L_p$-based adversarial attacks in comparison with CNNs and 2) under Adversarial Examples (AEs) after applying preprocessing defense methods. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under $L_0$, $L_1$, $L_2$, $L_\infty$-based, and Color Channel Perturbations (CCP) attacks. 2) Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components while, hybrid-ViTs are more responsive to such defense. 3) CCP can be used as a preprocessing defense and larger ViT variants are found to be more responsive than other models. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.

翻译：关注型网络在许多计算机直观任务( 如图像分类) 中取得了最先进的表现。与 Convolutional Neal 网络( CNN) 不同, Vanilla Vivision 变异器( VIT) 的主要部分是吸引模拟输入图像全球背景的力量的焦点块。这种力量是数据饥饿, 因此, 培训数据越多, 效果就越好。为了克服这一限制, 许多 Vit- 基网络, 或混合- ViT 在培训中将本地背景包含在培训中。 Vitracal 网络及其变异器对对抗性攻击的强性并没有被广泛投资于文献。一些强性属性在以往的作品中被揭示, 因此, 更敏锐的视觉强性属性尚未被读取。这项工作研究 ViT 与以美元为基的战前对抗性攻击的强性数据相比, 2 在Aversarial Indation( AEs) 中, 我们从图像网络1 000 $- k, 然后提供一系列实验, 并且提供更强性的 VIL 战略- 3 的防御模型比 Van- Van- ViL prevL 。