With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.
翻译:目前,随着ML基础设施的日益多样化,在多种计算机系统上分布的培训将有利于大型模型的制作。我们建议了混合专家模型(MoE)来降低培训成本,但培训规模要取决于模型/数据的总体规模,通过分而解的方式,通过分而解的方式,通过模型/平行的方式,通过差异式基础设施进行大规模的MOE培训,培训和推断的效率可以从若干系统方面进一步提高,包括负载平衡、通信/交接效率以及记忆足迹限制。在此工作中,我们提出了S-MoE模型(MoE)模型,建议通过2D预发和分解存储来降低模型/数据的总体规模的培训成本。在单个节点中,特别是当模型大小大于GPU记忆时,SE-MoPA将 CU记忆联合成一个包含模型的组合,在存储部分中执行计算任务,以双向部分方式进行弹性的EOE培训,在连续的SEOFO中,在连续的SEOFA中,在连续的SEOFO中,在连续的测试中进行。