The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformer-based models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer, which is abbreviated from the `Vision-friendly Transformer'. With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. The code is available at https://github.com/danczs/Visformer.
翻译:在过去的一年中,应用变换器模块解决视觉问题的情况迅速发展。一些研究人员已经表明,变换器模型拥有适合的数据的有利能力,但越来越多的证据表明,这些模型特别在培训数据有限的情况下受到过度的适应。本文提供了一项经验性研究,通过逐步操作将一个变换器模型逐步转换到一个以革命为基础的模型。我们在转型过程中获得的结果为改进视觉识别提供了有用的信息。根据这些观察,我们提出了一个新的结构,名为“变换器”,该结构从“对视觉友好型变换器”中缩略下来。用同样的计算复杂性,变换器在图像网络分类精度方面优于“变换器”和“变换器”两种模型,当模型复杂程度较低或培训范围较小时,其优势就更为显著。该代码可在 https://github.com/danczs/Visforew查阅 。