Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve expert capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations. When pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models by 2.5\% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On small-scale downstream image classification tasks, i.e. Cifar10 and Cifar100, our Sparse-MLP can still achieve better performance than baselines.
翻译:在本文中,我们提议Sprassy-MLP, 将最近的MLP-Mixer模型与稀有的MOE层相匹配,以实现一个更高效的计算结构。我们用Sparse区块取代MLP-Mixer模型中密集的MLP区块。在Sparse区块中,我们应用了MOE层的两个阶段:一个是MLP专家,将基于关注的模型与图像补丁维度的频道内的信息混杂在一起,一个是MLP专家,在频道维度的补丁中将信息混杂在一起。此外,为了降低路由和增强专家能力方面的计算成本,我们设计了最新的MLP-Mixer模型,在每一个微小区块中,我们设计了重新展示层层。我们用两个简单有效的线性变形模型进行图像Net-1k 和MoCo v3算法的预培训时,我们的模型可以比图像网顶部1级和计算成本低的2.5°的MLP模型优于MU10级的图像网络下层图像和CSMA10级的基线任务,在小型、CMLMLS-S-S-S-C-C-C-C-C-C-C-C-CS-C-C-C-C-C-C-C-C-S-C-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-B-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-