改进:改进愿景变异器的自我关注 (Refiner: Refining Self-attention for Vision Transformers)

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

翻译：与CNN相比,视觉变异器在图像分类任务中表现出了竞争性的准确性。然而,它们通常需要更多模型预培训前的数据。因此,最近的工作大多致力于设计更复杂的结构或培训方法,以解决VIT的数据效率问题。然而,它们很少探索改进自我注意机制,这是将VIT与CNN区分开来的一个关键因素。与现有的工作不同,我们引入了一个概念上简单的计划,称为精炼器,直接改进VIT的自读地图。具体地说,精炼器探索了将多高度注意地图投射到一个高维度空间以促进其多样性的注意力扩大。此外,精炼器运用聚合法来增加当地关注地图的模式,我们所显示的这种模式相当于分散的当地关注特征,在当地以可学习的内核聚合在一起,然后以自省方式进行全球综合。广泛的实验表明,精炼器工作非常出色。重要的是,它使VITs能够实现仅81M参数的图像网络86%的上一级-1分类精度。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【斯坦福&Facebook】生成式对抗变换器，Generative Adversarial Transformers

专知会员服务

21+阅读 · 2021年4月21日

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日