Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at \href{https://github.com/Zhicaiwww/PosMLP}{https://github.com/Zhicaiwww/PosMLP}.
翻译:视觉-多层透视器(MLP)在计算机视觉任务中表现良好,并成为CNN和视觉变异器的主要竞争者。它们使用象征性混合层来捕捉交叉式互动,而不是变异器使用的多头自留机制。然而,大量参数化的象征性混合层自然缺乏机制来捕捉当地信息和多基因非本地关系,因此它们的歧视力量受到了限制。为了解决这个问题,我们提议建立一个新的定位平准单位(POSGU) 。它利用经典相对位置编码(RPE)中使用的注意配方,以高效率地编码交叉式关系进行代谢混合。它能够成功地将目前的四面形参数复杂性$O(N%2)$(MLP) 1 降低到$(N) 和$O(1)。我们尝试两个 RPE 模型,并进一步提出一组扩展,以通过完成多基因背景环境来改进它们所发现的直观能力。这些参数是经过培训的 Ral- mal- mal- mloginal 系统(RP) 系统,然后用来作为测试新类型ML 性表现的缩缩缩图案。