利用深海和威震天来培训威震卫星试验NLG 530B,一个大型生成语言模型 (Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model)

Shaden Smith,Mostofa Patwary,Brandon Norick,Patrick LeGresley,Samyam Rajbhandari,Jared Casper,Zhun Liu,Shrimai Prabhumoye,George Zerveas,Vijay Korthikanti,Elton Zhang,Rewon Child,Reza Yazdani Aminabadi,Julie Bernauer,Xia Song,Mohammad Shoeybi,Yuxiong He,Michael Houston,Saurabh Tiwary,Bryan Catanzaro

from arxiv, Shaden Smith and Mostofa Patwary contributed equally

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

翻译：经过预先培训的通用语言模型可以通过零射、短短和微调技术适应下游任务,在各种自然语言处理领域实现最先进的通用理解。由于这些模型的成功,这些模型的规模迅速扩大,需要高性能的硬件、软件和算法技术来培训这些大型模型。由于微软和荷兰荷兰语数据库的共同努力,我们介绍了在各种自然语言处理领域培训最大的单一变压器模型(MT-NLG 530B(MT-NLG)),5300亿参数。在本文件中,我们首先侧重于基础设施以及使用深射和威震器来培训这一模型的3D平行方法。接下来,我们详细介绍了培训进程、我们培训教材的设计以及我们的数据整理技术,我们认为这是模型成功的一个关键要素。最后,我们讨论了各种评估结果,以及MT-NLG展示的其他有趣的观察结果和新特性。我们证明,MT-NLG将首先侧重于基础设施的基础设施以及3D平行方法,我们坚信,将实现我们国家语言大规模学习成果的高级水平。