A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%.
翻译:最近提出了一个新的神经网络结构,称为“专家混合”(Mixture of-Experts (MoE),最近提出了一个新的神经网络结构(MoE),该结构增加了神经网络(基础模型)的参数,增加了微弱活跃的专家区块,但不改变用于培训或推断的浮点操作总数。理论上,这一结构允许我们培训任意大型模型,同时将计算成本与基模型保持在相同水平。然而,超过64至128个专家区块,先前的工作观察到这些实验模型在测试质量方面回报减少。因此,培训高质量的MOE模型要求我们扩大基础模型的规模,同时增加专家区块的数目。在这项工作中,我们建议采用新颖的、三维、混合平行算法,将色子、专家与数据平行法结合起来,以便能够以4-8x大的基础模型对MOE模型进行培训,而比目前的状态-深Speed-MoE模型大。我们建议在最佳步骤中进行记忆优化,以及消除多余数据移动的通信优化,要求我们扩大这些再能力模型的规模,同时增加专家的数量。在40亿个基点模型上将40亿个模型从近21 %的模型从基点提升到40亿个模型中提供近100%的模型。</s>