通过超级调子进行视觉变形器全球互动建模 (Global Interaction Modelling in Vision Transformer via Super Tokens)

With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for cross-window communication and global representation learning. Hence, most of the learning is independent of the image patches $(N)$ in the higher layers, and the class embedding is learned solely based on the Super tokens $(N/M^2)$ where $M^2$ is the window size. In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5\% accuracy which is equivalent to Swin transformer (Swin-B) with circa half the number of parameters (49M) and double the inference time throughput. The proposed Super token transformer offers a lightweight and promising backbone for visual recognition tasks.

翻译：随着计算机视野中变异器结构的普及,研究焦点已经转向开发计算效率高的设计。基于窗口的地方关注是最近作品中采用的主要技术之一。这些方法从非常小的补丁大小和小型嵌入维度开始,然后进行螺旋变(批装合并),以减少特征地图大小并增加嵌入维度,从而形成像设计这样的金字塔级连锁神经神经网络(CNN) 。在这项工作中,我们调查变异器中的地方和全球信息建模,方法是推出一个新颖的异质结构,采用本地窗口和特殊标志,称为超级标志,供自用。具体地说,为每个显示该窗口丰富本地细节的图像窗口都指定了一个单一超级标志。这些标志随后被用于跨风速通信和全球代表学习。因此,大多数学习都独立于高层的图像补丁(N)美元,而班级嵌入仅根据超级符号$(N/M%2)来学习,其中,$M%2$是窗口的轻值,用于该窗口大小。在标准图像S-Simnet的变压S-S-levallifill Silal分类中,S-58(Silevlexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx