All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
翻译:作为关注型模型的替代,所有MLP结构吸引了越来越多的人的兴趣。在《国家劳工政策》中,所有MLP(GMLP)最近的工作表明,所有MLP(MLP)都能在语言建模中与变异器相匹配,但在下游任务中仍然落后。在这项工作中,我们分析了MLP(MLP)在外观和投入(制式)两方面的外观和混合专家(制式)的微弱活化 MLP(ME)的局限性。这种稀疏的所有MLP(MP)大大增加了模型容量和表达性,同时保持了计算不变。我们应对了将有条件计算纳入两个路由战略的重大挑战。拟议的所有MLP(MP)都改进了语言建模的变异功能,在培训效率方面得到了高达2美元的提高,而与基于变异器的MOE(SG、开关变换器、基图层和HASH图层)以及密集的变异器和所有MLP相比,我们评估了其在下游任务中的零光化器学习表现。我们发现它超过了以变异器为基础的MOE和高压器。