Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. The code and models are publicly available at https://github.com/microsoft/SPACH
翻译:在计算机视觉领域, 变异器在计算机视觉领域出现。 在这项工作中, 我们探索变异器中的核心自我注意模块是否是实现图像识别优异性的关键。 为此, 我们根据基于 MLP 的现有视觉模型, 构建了一个名为 SMLPNet 的无关注网络。 具体地说, 我们用新颖的稀疏 MLP (sMLP) 模块取代了代号混合步骤中的 MLP 模块。 对于 2D 图像标记, SMLP 沿着轴方向应用 1D MLP 核心自我注意模块, 参数在各行或列之间共享。 通过分散的连接和重量共享, SMLP 模块会显著减少模型参数和计算复杂性的数量, 避免影响 MLP 类似模型的性能的常见问题。 在仅接受图像Net-1 K数据集培训时, 拟议的 SMLPNet 网络最多只能达到81. 1% 的精度, 仅使用24M 参数, 这比大多数CNNSM/ IM 和视觉变异器在相同的大小限制下共享。 在将参数提升到 66MLP- 参数时, SMLP- malP- Net 一定的自我精确机制 。 。 在 SMLODLODLODLO- 必然是S- 1 。 在S- simaleval- supal- sillional- supal- supal- 上, 3- suptal- pal- sal- sal- sal- sal- silmal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sild- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- silmal- commal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- s- s- sal- s- sal- sal- sal- sal- sal- sal- sal- sal- sal