何以造就等级式的愿景变异器? (What Makes for Hierarchical Vision Transformer?)

Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed as LinMapper can achieve very strong performance in ImageNet-1k image recognition. Moreover, we find that LinMapper is able to better leverage the pre-trained representations from image recognition and demonstrates excellent transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches, which all give similar competitive results. Our study reveals that the \textbf{macro architecture} of Swin model families, other than specific aggregation layers or specific means of cross-window communication, may be more responsible for its strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm. Code and models will be publicly available to facilitate future research.

翻译：最近的研究表明, 等级式的视野变换器, 其宏观结构是互换式的、不受过度限制的窗口自我关注的宏观结构。在这个手稿中, 我们怀疑自控是否是唯一等级式的视野变换器在各种视觉识别任务中达到最先进的性能, 并挑战无处不在的共振神经网络(CNNs ), 使用高密度的细滑内核内核内核。大多数后续工程都试图用其他类型的跨窗口通信模式来取代上风操作, 同时将自留式视为基于窗口的信息集合的跨式标准。在这个手稿中, 我们怀疑自留式自留式是否是唯一的级变型变型的, 以及不同类型的跨窗口交流的效果。我们的自留式演算法, 也可以在每部内自置式的自置式演算工具中。