多任务多语言多语言模式的可扩展和高效的教育部培训 (Scalable and Efficient MoE Training for Multitask Multilingual Models)

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

翻译：专家混合模型(MOE)是一个新兴的低效深层次学习模型,这些模型在参数方面具有次线性计算成本。与密集模型相比,教育部的稀薄结构提供了大幅增长模型规模的机会,其精度增长显著,同时消耗了低得多的计算预算。然而,支持教育部的大规模培训也有其自身的系统和建模挑战。为了克服挑战并抓住教育部的机遇,我们首先开发了一个能够将教育部模型有效推广到万亿个参数的系统。它与教育部和谐地结合了多维平行和多种记忆技术,以赋予教育部在与现有工作相比同一硬件上8x大模型的权能。除了提高系统效率外,我们还提出了新的培训方法,以提高教育部样本效率并利用专家调整战略来提高发酵时间效率。通过将高效的系统和培训方法相结合,我们得以大幅扩大语言生成的大型多功能多功能模式,从而大大提高模型的准确度。一个在50种语言上受过100亿个参数培训的模型可以实现机器高效翻译和多语言的开放源化图书馆的状态性工作。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【北京大学】动态异构图神经网络建模情感，Jointly Modeling Aspect and Sentiment with Dynamic Heterogeneous Graph Neural Networks

专知会员服务

55+阅读 · 2020年4月15日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日