Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
翻译:由于有见地的架构设计和关注机制,视觉变异器最近对许多视觉任务表现出巨大的希望。我们从经验上观察了两个有趣的问题。首先,视觉变异器在深层展示了一种与感知相关的行为,在深层中,关注映射显示了全球范围几乎一致的背景,而不管交汇处的位置(也与头部无关 ) 。第二,关注图在本质上是稀少的,很少有符号主宰着注意力的权重;引入来自ConvNet的知识将在很大程度上平滑关注和增强性能。受以上观察的激励,我们将自我注意的配方转化为与感知相关的全球背景,并将全球背景进一步融入到演化中。由此产生的模型,即全动视野变异器(即FCVIT),纯粹由演化层层和牢固继承关注机制和演变动的优点,包括动态属性、权重共享、短程和长程地谱模型模型,等等。实验结果显示FCViT的效益。在低于14M的参数下,我们的FVIT-S-VIS-S-S-ILS-real slaft sal sal sal sal sal sal sal sal sal silation silation silation sylate sylations sess real sess reduction sylated.