图塔:规模化的适应性混合专家 (Tutel: Adaptive Mixture-of-Experts at Scale)

Changho Hwang,Wei Cui,Yifan Xiong,Ziyue Yang,Ze Liu,Han Hu,Zilong Wang,Rafael Salas,Jithin Jose,Prabhat Ram,Joe Chau,Peng Cheng,Fan Yang,Mao Yang,Yongqiang Xiong

In recent years, Mixture-of-Experts (MoE) has emerged as a promising technique for deep learning that can scale the model capacity to trillion-plus parameters while reducing the computing cost via sparse computation. While MoE opens a new frontier of exceedingly large models, its implementation over thousands of GPUs has been limited due to mismatch between the dynamic nature of MoE and static parallelism/pipelining of the system. We present Tutel, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Tutel delivers adaptive parallelism switching and adaptive pipelining at runtime, which achieves up to 1.74x and 2.00x single MoE layer speedup, respectively. We also propose a novel two-dimensional hierarchical algorithm for MoE communication speedup that outperforms the previous state-of-the-art up to 20.7x over 2,048 GPUs. Aggregating all techniques, Tutel finally delivers 4.96x and 5.75x speedup of a single MoE layer on 16 GPUs and 2,048 GPUs, respectively, over Fairseq: Meta's Facebook AI Research Sequence-to-Sequence Toolkit (Tutel is now partially adopted by Fairseq). Tutel source code is available in public: https://github.com/microsoft/tutel . Our evaluation shows that Tutel efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Tutel accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference. SwinV2-MoE is open sourced in https://github.com/microsoft/Swin-Transformer .

翻译：近年来,Mixture-Of-Exterts(MoE)已成为一种很有希望的深层次学习技术,可以将模型能力提升到万亿以上参数,同时通过稀释计算降低计算成本。虽然MoE开辟了超大型模型的新前沿,但由于MOE的动态性质和静态平行/管道系统之间的不匹配,Mixture-Od-Experts(MoE)有限。我们展示了Tutel,一个高度可扩缩的堆叠式设计和实施MOE,一个具有动态适应性、适应性平行和管道。Tutel在运行时提供适应性平行转换和适应性管道,分别达到1.74x和2.00x单一的moE层加速。我们还提议为MOE通信速度设定一个新的双维级级算法,比以往的目前20.7x系统同步同步。我们展示了所有技术,Tutel最终交付了4.96x和5.75xyorial-comlistal complain 。在16 GPUPS和2,048 GPO(现在分别显示Seal-EA-De-De-Desial-Develyal-Desial Steal-Destreal-Destreal-Destreal)中,现在显示了Stual-Destreal-Destreal-Destreal-delStual-Destrisal-Destration Stational-Destrational-Develyal-Develyal-Develys。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

加速图神经网络推理，121页ppt，普林斯顿大学JAVIER DUARTE主讲

专知会员服务

33+阅读 · 2022年6月13日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

62+阅读 · 2020年2月17日