Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is to apply adversarial training since it has been shown as one of the most effective ways to accomplish robust CNNs. However, one major limitation of adversarial training is its heavy computational cost. The self-attention mechanism adopted by ViTs is a computationally intense operation whose expense increases quadratically with the number of input patches, making adversarial training on ViTs even more time-consuming. In this work, we first comprehensively study fast adversarial training on a variety of vision transformers and illustrate the relationship between the efficiency and robustness. Then, to expediate adversarial training on ViTs, we propose an efficient Attention Guided Adversarial Training mechanism. Specifically, relying on the specialty of self-attention, we actively remove certain patch embeddings of each layer with an attention-guided dropping strategy during adversarial training. The slimmed self-attention modules accelerate the adversarial training on ViTs significantly. With only 65\% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
翻译:作为革命神经网络(CNN)的强大替代物,维特变形器(ViT)作为革命性神经网络(CNN)的强大替代物,受到了很多关注。最近的工作显示维特公司也容易受到CNN等对抗性例子的伤害。为了建立强大的维特公司,一个直观的方法是应用对抗性培训,因为它被显示为实现强力CNN公司的最有效方法之一。然而,对抗性培训的一个主要限制是其沉重的计算成本。维特公司采用的自我关注机制是一种计算性强的操作,其费用随着投入补丁的数量而增加四分化,使得维特公司的对抗性培训更加耗时。在这项工作中,我们首先全面研究各种愿景变形器的快速对抗性培训,并展示其效率和稳健之间的关系。然后,为了加速维特公司对立式的对抗性培训,我们建议高效关注引导反向培训机制。具体地依靠自我关注的特殊性,我们积极消除每一层的某些补丁,在对抗性培训期间的注意力引导下击退战略。我们首先全面研究关于各种愿景变式的自对立式培训,即快速对立式对立式的自我培训。