Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of MoE from the model design or system optimization perspective, existing MoE dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging, from a model-system co-design perspective, which can dynamically adjust the MoE dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x, 1.01x-4.77x, 1.25x-1.54x improvements over the popular DeepSpeed-MoE, FastMoE and FasterMoE.
翻译:尽管我们从模型设计或系统优化角度为改进教育部的性能做出了许多努力,但现有的教育部发送模式仍然无法充分利用基本的多元网络环境。在本文中,我们提议TA-MOE,这是一个具有地貌特征的大规模移动培训路线战略,从模型系统共同设计的角度,可以根据网络表层学动态调整MOE发送模式。根据通信模型学,我们将发送问题抽象成一个优化目标,并在不同的表层下获得大致的发送模式。此外,我们设计了一种具有地貌特征的辅助损失,这种损失可以在不牺牲模型准确性的情况下将数据适应基本地形。实验显示,TA-MoE可以大大超越各种硬件和模型配置的对应方,其改进幅度约为1.01x-1.61x、1.01x-4.77x、1.25摩斯快速和快速E.I.54。