In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.
翻译:在本文中,我们介绍视野透映器,这是一个概念简单、数据高效的MLP类视觉识别架构。通过认识由 2D 特征显示所传播的位置信息的重要性,与最近将空间信息编码在平坦空间维度上的MLP类模型不同,视觉透映器单独将高度和宽度维度的特征表现编码成线性投影。让视野透映器能够沿着一个空间方向捕捉远程依赖性,同时在另一个方向上保存精确的定位信息。由此产生的位置敏感输出随后以相互补充的方式汇总,形成对对象的表达式。我们显示,我们的视野透映器是革命神经网络和视觉变异器的强大竞争者。在不依赖空间变异或注意机制的情况下,视野透映射器可以在图像网上达到81.5%的上方位精确度,而没有额外的大规模培训数据(例如,图像网22k),仅使用25M可学习的参数,这比大多数CNN和视觉变异器在相同的模型Q2 规模限制下要好得多。我们向88M 的高级研究会促进M- 的精确度的M-Vial-laimal-deal-lax-lax-lax-lax