Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various image processing tasks. Especially, our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset, which is the smallest size vision Transformer model by now. Our code will be available in https://github.com/ucasligang/SimViT.
翻译:虽然愿景变换器在许多愿景任务中作为主干模型取得了卓越的成绩,但大多数变换器都打算捕捉图像或窗口中所有象征物的全球关系,这扰乱了2D结构中补丁之间的内在空间和地方相关性。在本文件中,我们引入了一个名为SimViT的简单愿景变换器,将空间结构和当地信息纳入愿景变换器中。具体地说,我们引入多头中央自控(MCSA)而不是传统的多头自控(MMCSA),以捕捉高度本地关系。引入滑动窗口有利于空间结构的捕捉。同时,SimViT从不同层中提取多级的等级特征,用于密集的预测任务。广泛的实验显示,SimViT作为各种图像处理任务的通用主干模型是有效和高效的。特别是,我们的SimViT-Micro仅需要3.3M参数,才能在图像网-1k数据集上达到71.1%的顶级精确度,该数据集是现在最小的视觉变换模型。我们的代码将在https://github.com/ucusligliangang/SimViang/SimVitT上查阅。