As giant dense models advance quality but require large amounts of GPU budgets for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, is proposed to scale models while keeping their computation constant. Specifically, the input tokens are routed by the gate network and only activates part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive high-bandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15% speedup. Specifically, HetuMoE outperforms DeepSpeed-MoE up to 8.1x under the switch gate with a batch size of 32. Our code is available at: https://github.com/PKU-DAIR/Hetu.
翻译:由于巨型密集模型提高了质量,但需要大量GPU预算来进行培训,因此,建议采用一种条件性计算结构,即微薄封闭型专家混合技术(MOE),以扩大模型规模,同时保持计算不变。具体地说,输入符号由大门网络设置,只是激活专家网络的一部分。现有的教育部培训系统只支持主流教育部模式的一部分(如Topk)培训,而这种培训是在昂贵的高带宽的GPU组合下进行的。在本文中,我们介绍HetuMoE,这是在赫图上建立的高性能的大型稀薄MOE培训系统。HetuMoE提供多种GUPNel策略和高效的GPU内核执行。为了进一步提高商品GPU集群的培训效率(如只有1 NC),我们引入了等级全端通信系统,将等级网络和信息集中起来。与现有的高频高频MPUE系统相比, HetuMoE至少获得15%的速度。具体地说,HtuMoE超越了深度Speed-MoE的系统,在32/MOKDA的升级到8A的系统下可以转换为AS/HE。