Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are available at https://github.com/khawar512/ViT-Survey
翻译:与进化神经网络(CNNs)相比,视觉变异器(VVTs)在各种视觉任务中越来越受欢迎和占主导优势的技术,与进化神经网络(CNNs)相比,这些变异器在各种视觉任务中越来越受欢迎和占据主导地位。作为计算机视觉方面的高要求技术,VITs成功地解决了各种视觉问题,同时侧重于远程关系。本文首先介绍了自我注意机制的基本概念和背景。接着,我们全面概述了最近表现最佳的VIT方法,从力量和弱点、计算成本以及培训和测试数据集等方面加以描述。我们透彻地比较了各种VIT算法和有代表性的CNN方法在流行基准数据集方面的表现。最后,我们用深刻的观察来探索一些局限性,并提供了进一步的研究方向。项目网页和文件集可在https://github.com/khawa512/ViT-Suurvey查阅。