CAT: 愿景转换器中的交叉关注 (CAT: Cross Attention in Vision Transformer)

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones. The code and models are available at \url{https://github.com/linhezheng19/CAT}.

翻译：由于变换器在NLP中广泛使用,CV中变换器的潜力已经实现,并激励了许多新的方法。然而,在图像代谢后,用变换器的图像补丁替换用图像补丁以图像补丁所需的计算是巨大的(例如ViT),这是瓶颈模型培训和推理的瓶颈。在本文中,我们建议在变换器称为Crosstention(CAT)中建立一个新的关注机制,在图像补丁中将注意力放在内侧而不是整个图像补丁,以捕捉当地信息,并在从单一通道特征地图中分离的图像补丁之间关注全球信息。两种操作的计算都比变换器中的标准自控量要少。通过在内部和补补补丁之间相互关注,我们进行交叉关注,以较低的计算成本维持性能,并为其他视觉任务建立一个称为Crossente Tranger(CAT)的分级网络。我们的基础模型在图像网-1K上达到状态,改进了其他方法在COCO和ADE20K上的性能表现。说明我们的网络具有作为一般脊柱子的潜力。代码和模型可在 shengurginez/urgresmus/ {urgresmus.

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

107+阅读 · 2020年8月30日

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日