Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision. This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Transformer in visual learning understanding. First, the attention mechanism is reviewed, which plays an essential part in Transformer. And then, the visual Transformer model and the principle of each module are introduced. Thirdly, the existing Transformer-based models are investigated, and their performance is compared in visual learning understanding applications. Three image tasks and two video tasks of computer vision are investigated. The former mainly includes image classification, object detection, and image segmentation. The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets. Finally, ten general problems are summarized, and the developing prospects of the visual Transformer are given in this review.
翻译:动态关注机制和全球建模能力 使变异器显示出很强的学习能力。 近年来,变异器在计算机视觉中已经与CNN方法相近。 本审查主要调查变异器在图像和视频应用方面的当前研究进展,对变异器在视觉学习理解方面的全面概述。 首先,对注意机制进行审查,在变异器中扮演重要角色。然后,引入了视觉变异器模型和每个模块的原则。第三,对现有变异器模型进行了调查,并在视觉学习应用中比较了它们的性能。调查了三种图像任务和计算机视觉的两种视频任务。前者主要包括图像分类、对象探测和图像分割。后者包含对象跟踪和图像分类。对于比较不同模型在若干公共基准数据集中的各项任务的业绩具有重要意义。最后,总结了10个一般性问题,并在本次审查中介绍了视觉变异器的发展前景。