FlexMoE：通过动态设备分配实现大规模稀疏预训练模型训练的扩展 (FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement)

With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible. In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads -- FlexMoE outperforms DeepSpeed by 1.70x on average and up to 2.10x, and outperforms FasterMoE by 1.30x on average and up to 1.45x.

翻译：随着数据量的增加，使用大规模预训练模型将知识存储至大量的模型参数中成为一种趋势。这些模型的训练由大量的稠密代数构成，需要大量的硬件资源。最近，稀疏门限混合专家（MoE）越来越流行，并在各种下游任务中展示了惊人的预训练可扩展性。然而，在实用系统中，这种稀疏条件计算可能不如预期的有效，因为存在路由不平衡和波动问题。通常，MoEs正在成为数据生命周期中的一种新的数据分析范式，并面临空前的规模、复杂性和粒度方面的独特挑战。在本文中，我们提出了一种新的DNN训练框架FlexMoE，以系统性和透明性地解决动态数据流引起的低效问题。我们首先对MoE模型的问题和机会进行了实证分析，这促使我们通过动态专家管理和设备分配机制来克服路由不平衡和波动问题。然后，我们在现有的DNN运行时上引入了一种新的调度模块，用于监视数据流量、制定调度计划和根据实时数据流量指导动态调整模型到硬件的映射。我们利用一种简单但高效的启发式算法来在训练过程中动态优化设备的放置。我们在自然语言处理模型（如BERT和GPT）和视觉模型（如Swin）上进行了实验。结果表明，在现实世界的工作负载上，FlexMoE表现优异，平均比现有系统DeepSpeed快1.70倍，最高可达2.10倍；平均比FasterMoE快1.30倍，最高可达1.45倍。