This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
翻译:本文展示了一个新的视觉变换器,称为 Swin 变换器。 变换窗口方案通过将自控计算限制在不重叠的地方窗口, 同时也允许跨窗口连接, 使变换器从语言到视觉的挑战来自两个领域之间的差异, 例如视觉实体规模的大幅变化, 图像像素与文字文字文字中的像素的高度分辨率。 为了解决这些差异, 我们提议了一个等级变换器, 其表达方式以\ textbf{ S}hifted\ textbf{ win}dows 来计算。 变换窗口方案通过将自控计算限制在不重叠的地方窗口中, 同时也允许跨窗口连接。 变换变变转换器的挑战来自两个领域的差异, 比如视觉实体规模的大小和图像像素的高度分辨率。 Swin 变换器的这些特性使得它与广泛的视觉任务相容, 包括图像分类( 图像网络-1 的精度) 和密集的预测任务, 例如天体模型(58.7 框 AP20 和510 高级的 ASloveyal AS- ASloveal ASloveal- del- del- del- ASloveal sural sural.