BVT:基于广泛关注的愿景变异器 (BViT: Broad Attention based Vision Transformer)

Recent works have demonstrated that transformer can achieve promising performance in computer vision, by exploiting the relationship among image patches with self-attention. While they only consider the attention in a single feature layer, but ignore the complementarity of attention in different levels. In this paper, we propose the broad attention to improve the performance by incorporating the attention relationship of different layers for vision transformer, which is called BViT. The broad attention is implemented by broad connection and parameter-free attention. Broad connection of each transformer layer promotes the transmission and integration of information for BViT. Without introducing additional trainable parameters, parameter-free attention jointly focuses on the already available attention information in different layers for extracting useful information and building their relationship. Experiments on image classification tasks demonstrate that BViT delivers state-of-the-art accuracy of 74.8\%/81.6\% top-1 accuracy on ImageNet with 5M/22M parameters. Moreover, we transfer BViT to downstream object recognition benchmarks to achieve 98.9\% and 89.9\% on CIFAR10 and CIFAR100 respectively that exceed ViT with fewer parameters. For the generalization test, the broad attention in Swin Transformer and T2T-ViT also bring an improvement of more than 1\%. To sum up, broad attention is promising to promote the performance of attention based models. Code and pre-trained models are available at https://github.com/DRL-CASIA/Broad_ViT.

翻译：最近的一些工作表明,变压器通过利用带有自我注意的图像补丁之间的关系,可以在计算机视野中取得有希望的成绩;虽然它们只考虑单一特性层的注意,而忽视不同层次的注意的互补性;在本文件中,我们建议广泛注意改进性能,将不同层次的注意关系纳入视觉变压器,称为BViT。广泛关注是通过广泛的连接和无参数的注意来实现的。每个变压器层的广泛连接促进BViT信息的传输和整合。在不增加可训练参数的情况下,无参数关注联合关注不同层次的现有注意信息,以提取有用的信息并建立它们之间的关系。图像分类任务的实验表明,BViT在图像网络上提供74.8 ⁇ /81.6 ⁇ 顶1级精确度,并附有5M/22M参数。此外,我们把BViT转到下游物体识别基准,以达到98.9 ⁇ 和89.9 ⁇ 。 CIFAR10和CIFAR100分别超过VT的确认基准。关于一般化测试的注意程度测试,Swin_B2级变压/T级的更广泛注意是基础的改进。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【CVPR 2022】基于时空解耦与重耦的RGB-D动作识别 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

专知会员服务

14+阅读 · 2022年3月19日

【CVPR 2022】基于层次化视觉语言知识蒸馏的开放词汇单阶段检测，Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

专知会员服务

7+阅读 · 2022年3月19日

【ICCV2021】基于Transformer 的神经绘画

专知会员服务

23+阅读 · 2021年9月20日

【CVPR2021】基于Transformers 从序列到序列的角度重新思考语义分割

专知会员服务

44+阅读 · 2021年3月15日