SpectFormer: 视觉Transformer中所需要的是频率和注意力 (SpectFormer: Frequency and Attention is what you need in a Vision Transformer)

Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT \cite{dosovitskiy2020image}, DeIT, \cite{touvron2021training}) similar to the original work in textual models or more recently based on spectral layers (Fnet\cite{lee2021fnet}, GFNet\cite{rao2021global}, AFNO\cite{guibas2021efficient}). We hypothesize that both spectral and multi-headed attention plays a major role. We investigate this hypothesis through this work and observe that indeed combining spectral and multi-headed attention layers provides a better transformer architecture. We thus propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature representation appropriately and it yields improved performance over other transformer representations. For instance, it improves the top-1 accuracy by 2\% on ImageNet compared to both GFNet-H and LiT. SpectFormer-S reaches 84.25\% top-1 accuracy on ImageNet-1K (state of the art for small version). Further, Spectformer-L achieves 85.7\% that is the state of the art for the comparable base version of the transformers. We further ensure that we obtain reasonable results in other scenarios such as transfer learning on standard datasets such as CIFAR-10, CIFAR-100, Oxford-IIIT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such of object detection and instance segmentation on the MS-COCO dataset and observe that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. Hence, we believe that combined spectral and attention layers are what are needed for vision transformers.

翻译：视觉Transformer已成功应用于图像识别任务。它们要么基于多头自注意力的形式(ViT，DeIT等)，类似于文本模型中的原始工作，要么基于频谱层(Fnet，GFNet，AFNO等)。我们假设频谱和多头自注意力层都发挥了重要作用。我们通过本文研究了这一假设，并观察到在Transformer架构中，结合频谱和多头注意力层确实可以获得更好的结果。因此，我们提出了新的Spectformer架构，它结合了频谱和多头自注意力层。我们认为所得到的表示可以适当地捕获特征表示，并且与其他Transformer表示相比，可以获得更好的性能。例如，在ImageNet上，它将top-1准确率提高了2%，相对于GFNet-H和LiT。SpectFormer-S在ImageNet-1K上达到了84.25%的top-1准确率(小型版本的最先进技术)。此外，Spectformer-L在可比的transformers基本版本上达到了85.7%的最先进技术水平。我们还确保在其他场景下，如在CIFAR-10、CIFAR-100、Oxford-IIIT-flower和Standford Car数据集上的迁移学习中获得合理的结果。我们然后调查了它在MS-COCO数据集上的物体检测和实例分割等下游任务中的使用，并观察到Spectformer显示出与最佳骨干相当的一致性的性能，并且可以进一步优化和改进。因此，我们认为结合频谱和注意力层是视觉Transformer所需要的。