Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.
翻译:最近的研究显示,视觉变异器(View Greeners)在对付各种腐败方面表现出很强的强力。虽然这一特性部分归因于自我注意机制,但仍然缺乏系统的理解。在本文件中,我们审视了自我注意在学习强健形象方面所起的作用。我们的研究的动机是视觉变异器中新兴视觉组合的令人感兴趣的特性,这表明自我注意可以通过改进中级代表来增强稳健性。我们进一步提议建立一个全注意力网络(FANs)组成的大家庭,通过纳入关注通道处理设计来加强这种能力。我们全面验证了各等级脊椎的设计。我们的模型在图像网络-1k和图像网络-C上实现了87.1%的精确度和35.8%的MC,有76.8M参数。我们还展示了两个下游任务中的最新精确度和稳健性:语系分割和对象探测。代码见:https://github.com/NVlabs/FAN。