The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies. Nevertheless, the large receptive field of Vision Transformer is accompanied by the huge computational cost. To boost efficiency, the window-based Vision Transformers emerge. They crop an image into several local windows, and the self-attention is conducted within each window. To bring back the global receptive field, window-based Vision Transformers have devoted a lot of efforts to achieving cross-window communications by developing several sophisticated operations. In this work, we check the necessity of the key design element of Swin Transformer, the shifted window partitioning. We discover that a simple depthwise convolution is sufficient for achieving effective cross-window communications. Specifically, with the existence of the depthwise convolution, the shifted window configuration in Swin Transformer cannot lead to an additional performance improvement. Thus, we degenerate the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning. The proposed Win Transformer is conceptually simpler and easier for implementation than Swin Transformer. Meanwhile, our Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic segmentation, and object detection.
翻译:在自然语言处理过程中,变异器的巨大成就激励了计算机视觉界的研究人员建立愿景变异器。与进化神经网络相比,一个愿景变异器拥有一个更大的可接受域,能够描述长距离依赖性。然而,视野变异器的可接受域伴随着巨大的计算成本。为了提高效率,基于窗口的愿景变异器出现。它们将图像植入几个本地窗口,并在每个窗口内进行自我关注。为了让全球可接受域重新回到全球可接受域,基于窗口的愿景变异器通过开发若干复杂的操作,将大量精力用于实现跨窗口通信。在这项工作中,我们检查了Swin变异器(变异器)关键设计要素的必要性,改变窗口分割。我们发现简单的深度变异器足以实现有效的跨窗口通信。具体地说,随着深度变异变,Swin变变器的窗口配置变化无法导致进一步的性能改进。因此,我们把Swin变异器转换成基于简单窗口的变异器(Win Indeveloporate Transansforation),而不是通过更简单的变换S-chillerver 变换S-shistable 。