We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.
翻译:在本文中,我们提出了一个新的结构,名为 " 革命愿景变异器 " (CvT),它通过将变异器引入ViT,提高性能和效率,提高性能和效率。这是通过两个主要修改实现的:一个是包含新的革命象征嵌入的变异器的等级,另一个是利用连动投影的变异器块。这些变化为ViT结构引入了革命神经网络(\iubimageNet-22k)的理想属性,同时保持变异器的优点(\ 动态关注,全球背景,以及更好的概括化)。我们通过进行广泛的实验来验证CvT。我们通过进行广泛的实验,表明这一方法比其他视觉变异器和图像Net-1k上的ResNets达到最先进的性能表现,而参数更少,FLOPs更低。此外,在对较大的模型集进行预先培训(\egimualNet/22k)和对下游任务进行微调时,业绩增益。在图像网22k上进行了预先训练,我们在SvT-W24的Silveral-stendal delfilal delformamaxxxxal 将获得我们当前版本版本版本定义的高级定义。