展望中的变革者:调查 (Transformers in Vision: A Survey)

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

翻译：自然语言任务变换模型的惊人结果已经吸引了视觉群落来研究其对计算机视觉问题的应用。变换器的显著好处之一是,能够建模输入序列元素之间的长期依赖性,支持与经常性网络相比的平行序列处理,例如,长期短期记忆(LSTM ) 。不同于同步网络,变换器的设计需要最小的感应偏差,并且自然适合设定功能。此外,变换器的直观设计能够利用类似的处理块处理多种模式(如图像、视频、文本和语音),并展示出对巨大能力网络和庞大数据集的极强可缩性。这些强势使得使用变换器网络的一些视觉任务取得了令人振奋的进展。这次调查的目的是全面概述计算机视觉学科中的变换模型。我们首先介绍变换器成功背后的基本概念,例如:自我感知、大尺度的预演前和双向图像分类。我们随后在视觉(如、视觉升级、图像分析、图像分析、图像分解、图像分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解、视觉分解等)中,我们先入、先入、先入、先入、分解和分解、先入、先入、分解、分解、分解、我们,然后将、我们,然后在视觉、我们,然后在视觉上广泛,然后在视觉任务中广泛应用中广泛应用,我们,我们,我们,然后将变变变换,我们,我们,我们,我们,我们,我们,我们,然后将变换,我们,我们,我们,然后将变换,然后将变器,然后将变换,然后在视觉(例如,然后在模型、图,然后在模型化的分解、视觉、直判、直判、直判、直判、直判、视觉、视觉、视觉-解、视觉-、视觉、视觉-、视觉、视觉-解、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、直判、