Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones. The code and models are available at \url{https://github.com/linhezheng19/CAT}.
翻译:由于变换器在NLP中广泛使用,CV中变换器的潜力已经实现,并激励了许多新的方法。然而,在图像代谢后,用变换器的图像补丁替换用图像补丁以图像补丁所需的计算是巨大的(例如ViT),这是瓶颈模型培训和推理的瓶颈。在本文中,我们建议在变换器称为Crosstention(CAT)中建立一个新的关注机制,在图像补丁中将注意力放在内侧而不是整个图像补丁,以捕捉当地信息,并在从单一通道特征地图中分离的图像补丁之间关注全球信息。两种操作的计算都比变换器中的标准自控量要少。通过在内部和补补补丁之间相互关注,我们进行交叉关注,以较低的计算成本维持性能,并为其他视觉任务建立一个称为Crossente Tranger(CAT)的分级网络。我们的基础模型在图像网-1K上达到状态,改进了其他方法在COCO和ADE20K上的性能表现。说明我们的网络具有作为一般脊柱子的潜力。代码和模型可在 shengurginez/urgresmus/ {urgresmus.