Transformer is a type of deep neural network mainly based on self-attention mechanism which is originally applied in natural language processing field. Inspired by the strong representation ability of transformer, researchers propose to extend transformer for computer vision tasks. Transformer-based models show competitive and even better performance on various visual benchmarks compared to other network types such as convolutional networks and recurrent networks. In this paper we provide a literature review of these visual transformer models by categorizing them in different tasks and analyze the advantages and disadvantages of these methods. In particular, the main categories include the basic image classification, high-level vision, low-level vision and video processing. Self-attention in computer vision is also briefly revisited as self-attention is the base component in transformer. Efficient transformer methods are included for pushing transformer into real applications. Finally, we give a discussion about the further research directions for visual transformer.
翻译:变异器是一种深层神经网络,主要基于最初应用于自然语言处理领域的自留机制。受变异器强大代表能力的启发,研究人员提议扩大变异器,用于计算机视觉任务。以变异器为基础的模型在各种视觉基准方面表现出竞争力,而且与其他网络类型相比,例如革命网络和经常性网络相比,表现更好。在本文件中,我们对这些视觉变异器模型进行文献审查,将其分为不同任务,并分析这些方法的优缺点。特别是,主要类别包括基本图像分类、高水平视觉、低水平视觉和视频处理。计算机视觉的自留也是简要的,因为自我注意是变异器的基本组成部分。将高效变异器纳入实际应用中。最后,我们讨论视觉变异器的进一步研究方向。