While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85% for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny-transformers.
翻译:虽然视觉变异器(VT)架构正在计算机视觉中变得潮流,但纯VT模型在微小的数据集中表现不佳。为解决这一问题,本文件提出改进微小数据集中VT系统性能的地点指南。我们首先分析,由于VT的自留机制具有高度灵活性和内在的全球性,因此很难用有限的数据来学习对了解图像非常重要的本地信息。为了便利当地信息,我们通过模仿已经受过训练的微小神经网络(CNN)的特性,实现VT的所在地指南。为了解决这个问题,本文件提议了改进微小数据集中VT系统性能的定位指南。我们模仿已经受过训练的微小神经网络(CNN)的特性。在CNN的双轨学习模式下,由受过低分辨率图像训练的轻量CNN提供的地点指南足以加快VT的趋同度,并在很大程度上提高VT的性能。因此,我们的位置指导方法非常简单、高效,可以在微数据集中作为VT的增强性能方法。我们的方法可以大大改进VT的VT,在微小T的微数据类型数据中,用VT的递化方法和可比较的变小的变式数据转换。