Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
翻译:首先应用于自然语言处理领域的变压器是一种主要基于自我注意机制的深层神经网络,由于它具有很强的代表性能力,研究人员正在研究如何将变压器应用到计算机的视觉任务中。在各种视觉基准中,变压器模型与其他类型的网络类似或优于其他类型的网络,例如变压器和经常性神经网络。鉴于其高性能和较少需要针对视觉的感应偏差,变压器越来越受到计算机视觉界的更多关注。在本文中,我们审查这些变压器模型,将其分为不同的任务并分析其优缺点。我们探讨的主要类别包括主干网、高/中度视觉、低度视觉和视频处理。我们还包括将变压器推入实际设备应用程序的高效变压器方法。此外,我们还简要地研究了计算机视觉中的自我注意机制,因为它是变压器的基本组成部分。在本文的结尾,我们讨论了挑战,并为视觉变压器提供了若干进一步的研究方向。