SpectFormer：频率和注意力是视觉Transformer所需的 (SpectFormer: Frequency and Attention is what you need in a Vision Transformer)

from arxiv, The project page is available at this webpage \url{https://badripatro.github.io/SpectFormers/}. arXiv admin note: text overlap with arXiv:2207.04978 by other authors

Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT \cite{dosovitskiy2020image}, DeIT, \cite{touvron2021training}) similar to the original work in textual models or more recently based on spectral layers (Fnet\cite{lee2021fnet}, GFNet\cite{rao2021global}, AFNO\cite{guibas2021efficient}). We hypothesize that both spectral and multi-headed attention plays a major role. We investigate this hypothesis through this work and observe that indeed combining spectral and multi-headed attention layers provides a better transformer architecture. We thus propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature representation appropriately and it yields improved performance over other transformer representations. For instance, it improves the top-1 accuracy by 2\% on ImageNet compared to both GFNet-H and LiT. SpectFormer-S reaches 84.25\% top-1 accuracy on ImageNet-1K (state of the art for small version). Further, Spectformer-L achieves 85.7\% that is the state of the art for the comparable base version of the transformers. We further ensure that we obtain reasonable results in other scenarios such as transfer learning on standard datasets such as CIFAR-10, CIFAR-100, Oxford-IIIT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such of object detection and instance segmentation on the MS-COCO dataset and observe that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. Hence, we believe that combined spectral and attention layers are what are needed for vision transformers.

翻译：视觉Transformer已成功应用于图像识别任务。已经有基于多头自注意力（ViT\cite{dosovitskiy2020image}、DeIT、\cite{touvron2021training}）或者最近基于谱层（Fnet \cite{lee2021fnet}、GFNet\cite{rao2021global}、AFNO\cite{guibas2021efficient}）的变体，类似于文本模型的原始工作。我们假设谱层和多头注意力两者都起到重要作用。通过这项工作，我们调查了这个假设，并且观察到将谱层和多头注意力层结合起来可以提供更好的Transformer体系结构。因此，我们提出了新的SpectFormer体系结构，这个体系结构结合了谱层和多头注意力层。我们认为，这种组合表示允许Transformer适当地捕捉特征表示，并且相对于其他Transformer表示，可以实现更好的性能。例如，与GFNet-H和LiT相比，它将ImageNet的top-1精度提高了2％。SpectFormer-S在ImageNet-1K上达到了84.25％的top-1精度（小型版本的最新状态）。此外，Spectformer-L达到了85.7％的top-1精度，是可比较的基本版本Transformer的最新技术水平。我们进一步确保在其他场景（如CIFAR-10、CIFAR-100、Oxford-IIIT花卉和Standford Car数据集）的转移学习中获得合理结果。然后，我们研究了Spectformer在MS-COCO数据集上目标检测和实例分割等下游任务中的使用，发现Spectformer表现出与最佳支撑线的一致性性能，并且可以进一步优化和改进。因此，我们认为谱层和注意力层相结合是视觉Transformer所需要的。