Mixture-of-Experts (MoE) models have become a widely-adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts -- sparsely-activated feed-forward networks -- within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads. To break this performance-accuracy tradeoff, we introduce SYMI, an adaptive MoE training system. The key insight of SYMI is to decouple the placement of expert parameters from their large optimizer state. SYMI statically partitions the optimizer of each expert across all training nodes. Meanwhile, SYMI dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SYMI right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overhead. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SYMI is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.
翻译:专家混合(MoE)模型已成为一种广泛采用的解决方案,可在不线性增加计算量的情况下持续扩展模型规模。在MoE模型训练过程中,每个输入词元被动态路由至每个Transformer层内的专家子集——稀疏激活的前馈网络。训练过程中分配给各专家的词元分布存在显著差异且变化迅速。为处理专家间严重的负载不均衡问题,现有系统被迫丢弃分配给热门专家的词元(导致收敛性下降),或根据专家热度频繁重新分配资源(产生高昂的状态迁移开销)。为突破这种性能与精度的权衡,我们提出了自适应MoE训练系统SYMI。SYMI的核心思想是将专家参数的放置位置与其庞大的优化器状态解耦。SYMI将每个专家的优化器静态分区至所有训练节点,同时通过复用现有的权重更新动态调整专家参数的放置位置,从而避免迁移开销。通过这种方式,SYMI能够以最小开销实现每轮迭代中为各专家精准分配GPU资源。相较于最先进的MoE训练系统DeepSpeed和FlexMoE,SYMI分别实现了30.5%和25.9%的收敛速度提升。