使用威电-LM的GPU集群高效大型语言示范培训 (Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM)

Deepak Narayanan,Mohammad Shoeybi,Jared Casper,Patrick LeGresley,Mostofa Patwary,Vijay Anand Korthikanti,Dmitri Vainbrand,Prethvi Kashinkunti,Julie Bernauer,Bryan Catanzaro,Amar Phanishayee,Matei Zaharia

from arxiv, Accepted to SC 2021

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm.

翻译：大型语言模型导致了一系列任务中最先进的理解。然而,由于以下两个原因,这些模型的高效培训具有挑战性:(a) GPU记忆能力有限,无法将大型模型安装在甚至多GPU服务器上;(b) 培训这些模型所需的计算操作数量可能导致不切实际的培训时间过长。因此,提出了新的模型平行方法,如高压和管道平行法等。不幸的是,这些方法的天真的使用导致数千个GPU(例如,由于昂贵的交叉节点通信或设备花费大量时间等待其他设备来取得进展,因此这些模型具有挑战性。在本论文中,我们展示了不同类型的平行方法(十比、管道和数据平行法)可以分为数千个GPU和数万亿参数的模型。我们测量管道平行技术并提出新的管道平行表,可以提高10比的存储源的存储点。我们量化地研究了我们50个Soldor%的模型、管道/轨道和数据平行法的模型之间的贸易量。我们用50个模型来分析我们10比方的GFLFL的模型和数据配置,通过我们10比的模型和直观来提供我们10比。