Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server; b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm.
翻译:大型语言模型导致了一系列任务中最先进的超能力。然而,由于以下两个原因,这些模型的高效培训具有挑战性:(a) GPU的记忆能力有限,无法将大型模型安装在甚至多GPU服务器上;(b) 培训这些模型所需的计算操作数量可能导致不切实际的培训时间过长。因此,提出了新的模型平行方法,如阵列和管道平行法等。不幸的是,这些方法的天真的使用导致数千个GPU(例如,由于昂贵的交叉节点通信或设备花费大量时间等待其他设备来取得进展,因此这些模型的公开扩展问题具有挑战性。在本文件中,我们展示了不同类型的平行方法(电流、管道和数据平行法)如何组成成数千个GPU和具有数万亿参数的模型。我们测量了管道平行技术,并提出了一个新的离流的管道平行时间表,该时间表可以改进10 ⁇ 的存储点。我们量化了索洛尔、管道、管道和数据同步法的折叠度,我们通过50个模型/轨道的模型和数据平行参数来进行大规模分析。