Recent works have demonstrated that transformer can achieve promising performance in computer vision, by exploiting the relationship among image patches with self-attention. While they only consider the attention in a single feature layer, but ignore the complementarity of attention in different levels. In this paper, we propose the broad attention to improve the performance by incorporating the attention relationship of different layers for vision transformer, which is called BViT. The broad attention is implemented by broad connection and parameter-free attention. Broad connection of each transformer layer promotes the transmission and integration of information for BViT. Without introducing additional trainable parameters, parameter-free attention jointly focuses on the already available attention information in different layers for extracting useful information and building their relationship. Experiments on image classification tasks demonstrate that BViT delivers state-of-the-art accuracy of 74.8\%/81.6\% top-1 accuracy on ImageNet with 5M/22M parameters. Moreover, we transfer BViT to downstream object recognition benchmarks to achieve 98.9\% and 89.9\% on CIFAR10 and CIFAR100 respectively that exceed ViT with fewer parameters. For the generalization test, the broad attention in Swin Transformer and T2T-ViT also bring an improvement of more than 1\%. To sum up, broad attention is promising to promote the performance of attention based models. Code and pre-trained models are available at https://github.com/DRL-CASIA/Broad_ViT.
翻译:最近的一些工作表明,变压器通过利用带有自我注意的图像补丁之间的关系,可以在计算机视野中取得有希望的成绩;虽然它们只考虑单一特性层的注意,而忽视不同层次的注意的互补性;在本文件中,我们建议广泛注意改进性能,将不同层次的注意关系纳入视觉变压器,称为BViT。广泛关注是通过广泛的连接和无参数的注意来实现的。每个变压器层的广泛连接促进BViT信息的传输和整合。在不增加可训练参数的情况下,无参数关注联合关注不同层次的现有注意信息,以提取有用的信息并建立它们之间的关系。图像分类任务的实验表明,BViT在图像网络上提供74.8 ⁇ /81.6 ⁇ 顶1级精确度,并附有5M/22M参数。此外,我们把BViT转到下游物体识别基准,以达到98.9 ⁇ 和89.9 ⁇ 。 CIFAR10和CIFAR100分别超过VT的确认基准。关于一般化测试的注意程度测试,Swin_B2级变压/T级的更广泛注意是基础的改进。