In recent years, Mixture-of-Experts (MoE) has emerged as a promising technique for deep learning that can scale the model capacity to trillion-plus parameters while reducing the computing cost via sparse computation. While MoE opens a new frontier of exceedingly large models, its implementation over thousands of GPUs has been limited due to mismatch between the dynamic nature of MoE and static parallelism/pipelining of the system. We present Tutel, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Tutel delivers adaptive parallelism switching and adaptive pipelining at runtime, which achieves up to 1.74x and 2.00x single MoE layer speedup, respectively. We also propose a novel two-dimensional hierarchical algorithm for MoE communication speedup that outperforms the previous state-of-the-art up to 20.7x over 2,048 GPUs. Aggregating all techniques, Tutel finally delivers 4.96x and 5.75x speedup of a single MoE layer on 16 GPUs and 2,048 GPUs, respectively, over Fairseq: Meta's Facebook AI Research Sequence-to-Sequence Toolkit (Tutel is now partially adopted by Fairseq). Tutel source code is available in public: https://github.com/microsoft/tutel . Our evaluation shows that Tutel efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Tutel accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference. SwinV2-MoE is open sourced in https://github.com/microsoft/Swin-Transformer .
翻译:近年来,Mixture-Of-Exterts(MoE)已成为一种很有希望的深层次学习技术,可以将模型能力提升到万亿以上参数,同时通过稀释计算降低计算成本。虽然MoE开辟了超大型模型的新前沿,但由于MOE的动态性质和静态平行/管道系统之间的不匹配,Mixture-Od-Experts(MoE)有限。我们展示了Tutel,一个高度可扩缩的堆叠式设计和实施MOE,一个具有动态适应性、适应性平行和管道。Tutel在运行时提供适应性平行转换和适应性管道,分别达到1.74x和2.00x单一的moE层加速。我们还提议为MOE通信速度设定一个新的双维级级算法,比以往的目前20.7x系统同步同步。我们展示了所有技术,Tutel最终交付了4.96x和5.75xyorial-comlistal complain 。在16 GPUPS和2,048 GPO(现在分别显示Seal-EA-De-De-Desial-Develyal-Desial Steal-Destreal-Destreal-Destreal)中,现在显示了Stual-Destreal-Destreal-Destreal-delStual-Destrisal-Destration Stational-Destrational-Develyal-Develyal-Develys。