Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.
翻译:最近,将变压器结构应用于图像分类任务的视野变异器(VIT)已经超过了进化神经网络,然而,VIT的高性能是使用诸如JFT-300M等大型数据集进行预培训的结果,它对大型数据集的依赖性被解释为低地点感应偏差。本文提议改变补丁(SPT)和地方自控(LSA),这有效地解决了没有地方诱导偏差的问题,使其能够从零开始学习,甚至从小尺寸数据集中学习。此外,小组委员会和LSA是通用和有效的附加模块,易于适用于各种VTs。实验结果表明,在对VITs应用小组委员会和LSA时,Tiniy-ImagNet(具有代表性的小型数据集)的性能平均提高了2.96%。特别是,Swin变换器由于拟议中的小组委员会和LSA而实现了4.08%的惊人性能改进。