CSWin变形器:通用视野变形器背骨和交叉分割窗口 (CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows)

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.2 mIoU. The code and models will be available at https://github.com/microsoft/CSWin-Transformer.

翻译：我们为通用愿景任务展示了CSWin变形器,这是一个高效、有效的基于UFergin的骨干。在变异器设计中,一个具有挑战性的问题是,全球自省对于计算成本非常昂贵,而本地自省往往限制每个象征的相互作用领域。为了解决这个问题,我们开发了跨共享窗口自我关注机制,用于计算水平和垂直条纹中的自关注,从而形成一个交叉形状的窗口,通过将输入特征分解成同等宽度的条形体获得每个条纹。我们提供了对条形宽度影响的详细数学分析,并改变不同变异的变异宽度,使变异器网络网络各层具有很强的建模能力,同时限制计算成本。为了解决这个问题,我们开发了跨共享窗口自控窗口自我关注机制,在水平和垂直条形条纹内,LePEPe自然支持任意输入分辨率,从而对下游任务特别有效且友好。由这些设计和等级结构整合,CSBEVinalive变异性变异功能在共同愿景任务上展示竞争性业绩,在SOOVO+OVOOOODODOD 上,在SOLODODODOD上,在SODODI上,在SODLI IM IM IM IM IM IM IM IM IM IM IM 上,在SOD IM IM IM IM IM 4 IM IM 上,在SL IM ASl IM IM IM IM IM IM IM IM IM IM IM IM IM IM 上,在S IM IM IM IM IM IM IM IM IM IM 上,在S IM IM IM IM IM IM IM 5,在S 上,在S 5,在S 5,在S 5,在SDODL 5,在S 5,在S 5,在S 5,在S 5,在S 5,在S 4 5,在S 5,在S 5,在S 5,在S 5,在S 5,在S 4 5,在