Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity. Our operator learns structural patterns by using a set of relative position embeddings (RPEs). To achieve log-linear complexity, the RPEs are approximated with fast Fourier transforms. Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators, achieving state-of-the-art results on ImageNet, and competitive results on other visual understanding benchmarks such as COCO and Something-Something-V2. The source code of our approach will be released online.
翻译:视觉变异器(ViTs)已成为与自我注意操作者进行视觉表现学习的主要范式。虽然这些操作者以可调整的注意内核为模型提供了灵活性,但它们受到固有的限制:(1) 注意内核不够有区别,导致ViT层的高度冗余,(2) 计算和记忆的复杂性在序列长度上是四级的。在本文中,我们提议一个新的关注操作者,称为轻量结构认知关注(LiSA),它具有更好的反映能力,具有日志-线性复杂度。我们的操作者通过使用一套相对位置嵌入(RPEs)来学习结构模式。为了实现日志线复杂性,RPEs近似于快速四级变形。我们的实验和模拟研究表明,基于拟议操作者的ViTs超越了自我注意能力和其他现有操作者的能力,在图像网络上取得了最先进的结果,而其他视觉理解基准,如CO 和Some-maint-V2,我们方法的源代码将在线发布。