Recently, a series of works in computer vision have shown promising results on various image and video understanding tasks using self-attention. However, due to the quadratic computational and memory complexities of self-attention, these works either apply attention only to low-resolution feature maps in later stages of a deep network or restrict the receptive field of attention in each layer to a small local region. To overcome these limitations, this work introduces a new global self-attention module, referred to as the GSA module, which is efficient enough to serve as the backbone component of a deep network. This module consists of two parallel layers: a content attention layer that attends to pixels based only on their content and a positional attention layer that attends to pixels based on their spatial locations. The output of this module is the sum of the outputs of the two layers. Based on the proposed GSA module, we introduce new standalone global attention-based deep networks that use GSA modules instead of convolutions to model pixel interactions. Due to the global extent of the proposed GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Our experimental results show that GSA networks outperform the corresponding convolution-based networks significantly on the CIFAR-100 and ImageNet datasets while using less parameters and computations. The proposed GSA networks also outperform various existing attention-based networks on the ImageNet dataset.
翻译:最近,一系列计算机愿景工程在利用自我注意的基础上,在各种图像和视频理解任务上展示了有希望的成果,然而,由于自我注意的复杂程度,这些工程要么只关注深层网络后期低分辨率地貌图,要么将每一层的可接受关注领域限制在小的当地区域。为了克服这些局限性,这项工作引入了新的全球自省模块,称为全球自省模块,该模块的效率足以成为深层网络的主干部分。这个模块由两个平行层组成:一个内容关注层,仅以其内容为基础关注像素,一个位置关注层,以其空间位置位置为基础关注像素。这个模块的产出是两层产出的总和。在拟议的全球自闭模块的基础上,我们引入一个新的独立全球关注深度网络,使用全球自省模块,而不是以进化为模型的像素互动。由于拟议的全球范围,一个全球自闭网络能够以其内容和存储点为基础,仅以其内容和位置关注层关注像素为主的像素,而关注层关注层层层关注层关注层层层关注层,而关注层关注层关注层关注层层关注层仅以其空间网络根据其空间位置关注层地图地图图。这个模块的模型输出是整个全球系统服务器网络的模型模型网络的模型模型,同时以现有图像网络的模型模型,同时显示整个网络的磁变形变形图像网络,同时显示整个网络的图像网络的模型的模型和图像网络。