Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern Convolution Neural Networks (CNNs). In this survey, we have reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where a taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, three promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at https://github.com/liuyang-ict/awesome-visual-transformers.
翻译:以关注为基础的编码器-代coder模型变换器已经使自然语言处理领域(NLP)发生了革命性变革。 在这种重大成就的启发下,最近开展了一些开创性工程,在计算机视觉(CV)领域采用类似变异器的架构,这在三种基本的CV任务(分类、检测和分解)以及多种感官数据流(图像、点云和视觉语言数据)上展示了其有效性。由于其具有竞争性的建模能力,视觉变异器在与现代神经神经网络(CNNs)相比的多重基准领域取得了令人印象深刻的绩效改进。在这个调查中,我们根据三种基本的CV任务和不同的数据流类型,全面审查了一百多个不同的视觉变异器结构,其中提议根据它们的动机、结构以及应用设想来组织代表性方法。由于在培训设置和专门愿景任务上的差异,我们还评估并比较了所有这些现有视觉变异器,因此在不同的配置下,我们披露了一系列重要但未经探索的方面,这些变异器在电子变异器/历史结构上可能增强这种视觉变异性的最新结构。