As a special type of transformer, Vision Transformers (ViTs) are used to various computer vision applications (CV), such as image recognition. There are several potential problems with convolutional neural networks (CNNs) that can be solved with ViTs. For image coding tasks like compression, super-resolution, segmentation, and denoising, different variants of the ViTs are used. The purpose of this survey is to present the first application of ViTs in CV. The survey is the first of its kind on ViTs for CVs to the best of our knowledge. In the first step, we classify different CV applications where ViTs are applicable. CV applications include image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection. Our next step is to review the state-of-the-art in each category and list the available models. Following that, we present a detailed analysis and comparison of each model and list its pros and cons. After that, we present our insights and lessons learned for each category. Moreover, we discuss several open research challenges and future research directions.
翻译:作为特殊类型的变压器,视觉变异器(VIVs)用于各种计算机视觉应用,例如图像识别。与VITs可以解决的进化神经网络(CNNs)存在若干潜在问题。对于图像编码任务,例如压缩、超分辨率、分解和分解,使用VITs的不同变体。本调查的目的是在CV中首次展示VITs的首次应用。调查是首次在VTs上对CVs进行这种应用,以获取我们的最佳知识。在第一步,我们分门别类地分类了不同的CV应用。CV应用包括图像分类、对象检测、图像分解、图像压缩、图像超分辨率、图像分解和异常检测。我们的下一步是审查每个类别中的最新技术,并列出现有的模型。随后,我们详细分析和比较了每一种模型,并列出其ProVs和Cons。随后,我们介绍了我们为每个类别提供的洞察力和经验教训。此外,我们讨论了若干开放的研究方向和未来研究方向。