Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks with a simple architecture and relatively small computational cost. Their success in maintaining computation efficiency is mainly attributed to avoiding the use of self-attention that is often computationally heavy, yet this is at the expense of not being able to mix tokens both globally and locally. In this paper, to exploit both global and local dependencies without self-attention, we present Mix-Shift-MLP (MS-MLP) which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting. In addition to conventional mixing and shifting techniques, MS-MLP mixes both neighboring and distant tokens from fine- to coarse-grained levels and then gathers them via a shifting operation. This directly contributes to the interactions between global and local tokens. Being simple to implement, MS-MLP achieves competitive performance in multiple vision benchmarks. For example, an MS-MLP with 85 million parameters achieves 83.8% top-1 classification accuracy on ImageNet-1K. Moreover, by combining MS-MLP with state-of-the-art Vision Transformers such as the Swin Transformer, we show MS-MLP achieves further improvements on three different model scales, e.g., by 0.5% on ImageNet-1K classification with Swin-B. The code is available at: https://github.com/JegZheng/MS-MLP.
翻译:Token- mixing 多层显示器( MLP) 模型在计算机视觉任务中表现出了具有简单架构和相对较小的计算成本的竞争性性能, 它们在维护计算效率方面的成功主要归功于避免使用通常计算繁重的自我关注, 但是这牺牲了无法在全球和地方混合代币。 在本文中, 为了利用全球和地方的相互依存关系而无需自我关注, 我们提出了 Mix- Shift- MLP (MS- MLP) 模型, 它使用于混合增加空间移动量的本地可接受字段( MS- MLP) 规模在空间移动量方面实现增加。 除了常规混合和移动技术外, MS- MLP 组合了从精密到粗粗粗的邻接线和遥远的象征, 然后通过移动操作收集这些象征。 这直接有助于全球和地方代币之间的互动。 执行起来很简单, MS- MLP 在多个愿景基准中, MS- mLP 具有8 500万个参数的分类, 使用于在图像网络- 网络- 1- 1 K 上实现83.8% 顶级分类精准。 此外, 通过将 MS- mal- mal- malP 在S- mal- mal- sal- sal- sal- slations 进一步显示S- sleval- 和S- silver- slaveal- s