Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at https://github.com/Zhicaiwww/PosMLP.
翻译:多层视觉感应器(MLPs)在计算机视觉任务中表现良好,并成为CNN和视觉变异器的主要竞争对手。它们使用象征性混合层来捕捉交叉式互动,而不是变异器使用的多头自留机制。然而,大量参数化的象征性混合层自然缺乏机制来捕捉当地信息和多基因非本地关系,因此其歧视力受到限制。为了解决这个问题,我们提议一个新的定位网间静态装置(POSGU)。它利用古典相对位置编码(RPE)中使用的注意配方,以高效率地编码交叉式关系进行象征性混合。它能够成功地降低目前四面形参数的复杂性$O(N%2)$(MLPs ) 1美元到 $O(N) 和$O(1)美元。我们试验两个RPeP 模型,并进一步提议一个小组化扩展,以改进其在多面形环境环境中的表态能力。然后,这些参数作为经过训练的精度模型,用于演示新类型MLML性能的微缩缩度。