Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. Codes are available at https://github.com/xiangyu8/DenseVT.
翻译:视觉变换器在最近成功实施视觉变换器(VIT)后吸引了许多关注。 有了视觉变换器, 特别是多头自我关注模块, 网络可以捕捉长期依赖性。 但是, 这些关注模块通常需要用大型数据集来培训, 而视觉变换器在小型数据集上的表现较差, 因为与ResNets等广泛占主导地位的骨干相比, 培训从零开始, 而培训从零开始, 却可以保持大部分信息, 从而增加信息密度。 请注意, 变换器模型首先被推荐用于自然语言处理, 它含有比自然图像更稠密的信息。 为了提高视觉变换器在小型数据集上的性能, 本文提议明确增加频率域内的输入信息密度。 具体地说, 我们引入了选择频道的方法, 方法是使用 Discrete Cosine 变换式(DCT) 计算频率域中频道的热图, 减少输入量的大小, 同时保持大多数信息密度, 从而增加信息密度。 结果, 保留了25%的频道, 而比以前的工作业绩更好。 广泛的实验展示了五个小规模数据集的拟议方法的有效性, 包括 CIFAR- 10/ 100, SVN 和SVEN1005 和Gillerual- gillers 。 。 在 Creal- gregyaldown- grecuilds is is is is is is is be s