Transformers have achieved widespread success in computer vision. At their heart, there is a Self-Attention (SA) mechanism, an inductive bias that associates each token in the input with every other token through a weighted basis. The standard SA mechanism has quadratic complexity with the sequence length, which impedes its utility to long sequences appearing in high resolution vision. Recently, inspired by operator learning for PDEs, Adaptive Fourier Neural Operators (AFNO) were introduced for high resolution attention based on global convolution that is efficiently implemented via FFT. However, the AFNO global filtering cannot well represent small and moderate scale structures that commonly appear in natural images. To leverage the coarse-to-fine scale structures we introduce a Multiscale Wavelet Attention (MWA) by leveraging wavelet neural operators which incurs linear complexity in the sequence size. We replace the attention in ViT with MWA and our experiments with CIFAR and ImageNet classification demonstrate significant improvement over alternative Fourier-based attentions such as AFNO and Global Filter Network (GFN).
翻译:Transformer在计算机视觉中已经取得了广泛的成功。在其核心中,存在一种自注意力机制(SA),它是一种归纳偏差,通过加权基础将输入中的每个标记与每个其他标记相关联。标准的SA机制具有与序列长度二次复杂度,这使得其在高分辨率视觉中出现长序列时的效用受到阻碍。最近,受偏微分方程运算学习的启发,引入了自适应傅里叶神经算子(AFNO)进行高分辨率attention 。然而,AFNO全局滤波不能很好地表示自然图像中常见的小型和中等尺度结构。为了利用粗到细的尺度结构,我们引入了一种多尺度小波注意力(MWA),通过利用小波神经运算实现在序列大小上的线性复杂度。我们将ViT中的注意力替换为MWA,我们对CIFAR和ImageNet分类的实验表明,与AFNO和全局滤波网络(GFN)等替代傅里叶注意力相比,我们取得了显著的改进。