扩大专家混合培训规模的新时代-专家-专家混合平行方针</s> (A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training)

A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%.

翻译：最近提出了一个新的神经网络结构,称为“专家混合”(Mixture of-Experts (MoE),最近提出了一个新的神经网络结构(MoE),该结构增加了神经网络(基础模型)的参数,增加了微弱活跃的专家区块,但不改变用于培训或推断的浮点操作总数。理论上,这一结构允许我们培训任意大型模型,同时将计算成本与基模型保持在相同水平。然而,超过64至128个专家区块,先前的工作观察到这些实验模型在测试质量方面回报减少。因此,培训高质量的MOE模型要求我们扩大基础模型的规模,同时增加专家区块的数目。在这项工作中,我们建议采用新颖的、三维、混合平行算法,将色子、专家与数据平行法结合起来,以便能够以4-8x大的基础模型对MOE模型进行培训,而比目前的状态-深Speed-MoE模型大。我们建议在最佳步骤中进行记忆优化,以及消除多余数据移动的通信优化,要求我们扩大这些再能力模型的规模,同时增加专家的数量。在40亿个基点模型上将40亿个模型从近21 %的模型从基点提升到40亿个模型中提供近100%的模型。</s>

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日