The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.
翻译:专家混合模型(MOE)是一个新兴的低效深层次学习模型,这些模型在参数方面具有次线性计算成本。与密集模型相比,教育部的稀薄结构提供了大幅增长模型规模的机会,其精度增长显著,同时消耗了低得多的计算预算。然而,支持教育部的大规模培训也有其自身的系统和建模挑战。为了克服挑战并抓住教育部的机遇,我们首先开发了一个能够将教育部模型有效推广到万亿个参数的系统。它与教育部和谐地结合了多维平行和多种记忆技术,以赋予教育部在与现有工作相比同一硬件上8x大模型的权能。除了提高系统效率外,我们还提出了新的培训方法,以提高教育部样本效率并利用专家调整战略来提高发酵时间效率。通过将高效的系统和培训方法相结合,我们得以大幅扩大语言生成的大型多功能多功能模式,从而大大提高模型的准确度。一个在50种语言上受过100亿个参数培训的模型可以实现机器高效翻译和多语言的开放源化图书馆的状态性工作。