Transformer is a type of deep neural network mainly based on self-attention mechanism which is originally applied in natural language processing field. Inspired by the strong representation ability of transformer, researchers propose to extend transformer for computer vision tasks. Transformer-based models show competitive and even better performance on various visual benchmarks compared to other network types such as convolutional networks and recurrent networks. With high performance and without inductive bias defined by human, transformer is receiving more and more attention from the visual community. In this paper we provide a literature review of these visual transformer models by categorizing them in different tasks and analyze the advantages and disadvantages of these methods. In particular, the main categories include the basic image classification, high-level vision, low-level vision and video processing. The self-attention in computer vision is also briefly revisited as self-attention is the base component in transformer. Efficient transformer methods are included for pushing transformer into real applications on the devices. Finally, we give a discussion about the challenges and further research directions for visual transformers.
翻译:变异器是一种深层神经网络,主要基于最初在自然语言处理领域应用的自我注意机制。受变异器强大代表能力的启发,研究人员提议扩大变异器,以完成计算机视觉任务。以变异器为基础的模型在各种视觉基准方面表现出竞争力,而且与其他网络类型相比,例如变异网络和经常性网络相比,表现更好。随着高性能和人类定义的无感应偏差,变异器越来越受到视觉界的注意。在本文中,我们提供了对这些视觉变异器模型的文献审查,将其分为不同任务,分析这些方法的优缺点。特别是,主要类别包括基本图像分类、高水平的视觉、低水平的视觉和视频处理。计算机视觉的自我注意也得到短暂的重新审视,因为自我注意是变异器的基本组成部分。包含有效的变异器方法,将变异变器推向设备上的实际应用。最后,我们讨论视觉变异器的挑战和进一步的研究方向。