Recent research on the robustness of deep learning has shown that Vision Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers argue that the superior robustness of ViT comes from the segmentation of its input images; others say that the Multi-head Self-Attention (MSA) is the key to preserving the robustness. In this paper, we aim to introduce a principled and unified theoretical framework to investigate such an argument on ViT's robustness. We first theoretically prove that, unlike Transformers in Natural Language Processing, ViTs are Lipschitz continuous. Then we theoretically analyze the adversarial robustness of ViTs from the perspective of the Cauchy Problem, via which we can quantify how the robustness propagates through layers. We demonstrate that the first and last layers are the critical factors to affect the robustness of ViTs. Furthermore, based on our theory, we empirically show that unlike the claims from existing research, MSA only contributes to the adversarial robustness of ViTs under weak adversarial attacks, e.g., FGSM, and surprisingly, MSA actually comprises the model's adversarial robustness under stronger attacks, e.g., PGD attacks.
翻译:最近关于深层学习的稳健性的研究显示,愿景转换器(VIT)在某些扰动下超越了革命神经网络(CNNs),例如自然腐败、对抗性攻击等。有些论文认为,VIT的超强性强性来自其投入图像的分割;另一些论文则认为,多头自我保护(MSA)是维护强性的关键。在本文件中,我们旨在引入一个原则性和统一的理论框架,以调查关于VIT稳健性的论点。我们首先从理论上证明,与自然语言处理中的变异器不同,VITs是连续的。然后,我们从理论上从Caugish问题的角度分析VITs的对抗性强性强性,我们可以通过这种分析如何通过层次来传播强性。我们证明,第一和最后一层是影响维特的稳健性的关键因素。此外,根据我们的理论,我们从经验上表明,与现有研究的主张不同,DIS只有助于VTs在自然语言处理过程中的对立性强力性攻击,而维Ts的对立性强性攻击实际上是强性对抗性攻击。FSMG。