Vision transformers (ViTs) process input images as sequences of patches via self-attention; a radically different architecture than convolutional neural networks (CNNs). This makes it interesting to study the adversarial feature space of ViT models and their transferability. In particular, we observe that adversarial patterns found via conventional adversarial attacks show very low black-box transferability even for large ViT models. However, we show that this phenomenon is only due to the sub-optimal attack procedures that do not leverage the true representation potential of ViTs. A deep ViT is composed of multiple blocks, with a consistent architecture comprising of self-attention and feed-forward layers, where each block is capable of independently producing a class token. Formulating an attack using only the last class token (conventional approach) does not directly leverage the discriminative information stored in the earlier tokens, leading to poor adversarial transferability of ViTs. Using the compositional nature of ViT models, we enhance the transferability of existing attacks by introducing two novel strategies specific to the architecture of ViT models. (i) Self-Ensemble: We propose a method to find multiple discriminative pathways by dissecting a single ViT model into an ensemble of networks. This allows explicitly utilizing class-specific information at each ViT block. (ii) Token Refinement: We then propose to refine the tokens to further enhance the discriminative capacity at each block of ViT. Our token refinement systematically combines the class tokens with structural information preserved within the patch tokens. An adversarial attack, when applied to such refined tokens within the ensemble of classifiers found in a single vision transformer, has significantly higher transferability.
翻译:视觉变压器( ViTs) 进程输入图像, 作为通过自我注意的补丁序列; 与进化神经网络( CNNs) 截然不同的架构。 研究ViT 模型的对抗性特征空间及其可转移性很有趣。 特别是, 我们观察到, 通过常规对抗性攻击发现的对抗性模式显示非常低的黑箱可转移性, 甚至大ViT 模型也是如此。 然而, 我们显示, 这种现象的唯一原因是亚最佳攻击程序, 它不能利用 ViT 的真正代表潜力。 深ViT 是由多个块组成的, 由自控性和进化向向前的螺旋网络构成的一致结构构成。 每个区块能够独立生成一个类符号。 仅仅使用最后类符号( 常规方法) 来构建攻击性攻击性信息, 导致ViT 的对抗性转移能力较差。 使用ViT 模型的构成性, 我们通过两种新的保存性战略来提高现有攻击的可转移性。 (i) 自我强化级的平流变变码网络 。