闪光和加热视觉变形器 (Glance-and-Gaze Vision Transformer)

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

翻译：最近,出现了一系列视觉变异器,这些变异器表现出优异性,其模型规模比常规的变异神经网络更小,其规模比常规的变异神经网络要小得多。但是,由于变异器具有模拟远距离依赖的强大能力,变异器的优势也随价格而来:变异器的核心部分“自留”对输入序列长度具有二次复杂性。这导致计算和记忆成本随着序列长度的延长而急剧增加。因此,在应用变异器对需要根据高分辨率特征地图进行密集预测的视觉任务进行大量预测时,出现了一些困难。在本文中,我们提出了一个新的变异器,名为“Glance-和加泽变异器”(GG-Transorforder),以解决上述问题。变异器的优势在于:变异器和变异器在自然场中识别物体时的“Glance”和“Gazeze”行为,从而能够有效地模拟长期依赖和当地环境。在GGGLO-Transforation上,我们通过两个平行的分会实现自我保持自我保持自我保存的自我保存的变现式变式变式变异变异性,而在,而通过直化的变异变法的场进行着向。