Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.
翻译:与CNN相比,视觉变异器在图像分类任务中表现出了竞争性的准确性。然而,它们通常需要更多模型预培训前的数据。因此,最近的工作大多致力于设计更复杂的结构或培训方法,以解决VIT的数据效率问题。然而,它们很少探索改进自我注意机制,这是将VIT与CNN区分开来的一个关键因素。与现有的工作不同,我们引入了一个概念上简单的计划,称为精炼器,直接改进VIT的自读地图。具体地说,精炼器探索了将多高度注意地图投射到一个高维度空间以促进其多样性的注意力扩大。此外,精炼器运用聚合法来增加当地关注地图的模式,我们所显示的这种模式相当于分散的当地关注特征,在当地以可学习的内核聚合在一起,然后以自省方式进行全球综合。广泛的实验表明,精炼器工作非常出色。重要的是,它使VITs能够实现仅81M参数的图像网络86%的上一级-1分类精度。