Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
翻译:经过预先培训的通用语言模型可以通过零射、短短和微调技术适应下游任务,在各种自然语言处理领域实现最先进的通用理解。由于这些模型的成功,这些模型的规模迅速扩大,需要高性能的硬件、软件和算法技术来培训这些大型模型。由于微软和荷兰荷兰语数据库的共同努力,我们介绍了在各种自然语言处理领域培训最大的单一变压器模型(MT-NLG 530B(MT-NLG)),5300亿参数。在本文件中,我们首先侧重于基础设施以及使用深射和威震器来培训这一模型的3D平行方法。接下来,我们详细介绍了培训进程、我们培训教材的设计以及我们的数据整理技术,我们认为这是模型成功的一个关键要素。最后,我们讨论了各种评估结果,以及MT-NLG展示的其他有趣的观察结果和新特性。我们证明,MT-NLG将首先侧重于基础设施的基础设施以及3D平行方法,我们坚信,将实现我们国家语言大规模学习成果的高级水平。