Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
翻译:基础模型正变得越来越受欢迎,但由于模型参数和训练数据集的大规模,对基础模型进行预训练始终是一项耗时的工作。除了计算密集度之外,训练过程还要消耗极大的内存和通信资源。这使得必须采用 3D 并行化技术,将数据并行化、管道模型并行化和张量模型并行化结合起来,以实现高效的训练。为此,一些自定义的软件框架,如 Megatron-LM 和 DeepSpeed 得以开发。然而,当前的 3D 并行性框架仍存在两个问题:i)它们不透明于模型开发者,需要手动修改模型进行并行化训练。ii)它们对计算、GPU 内存和网络带宽的利用不足。我们提出了 Merak,一种高资源利用率的自动化 3D 并行深度学习训练框架。Merak 自动部署,带有自动化模型分区器,该分区器在代理表示的模型上采用图分割算法。Merak 还提供了非侵入式的 API,以最少的代码修改扩展基础模型训练。此外,我们在 Merak 中设计了一个高性能的 3D 并行运行时引擎。它使用几种技术来利用可用的训练资源,包括移位关键路径管道调度,以提高计算利用率,阶段感知的重计算,利用空闲 worker 内存,子流水线的张量模型并行化,重叠通信和计算。在 64 个 GPU 上的实验证明,Merak 可以将 15 亿、25 亿、83 亿和 200 亿参数模型的训练性能提高高达 1.42X、1.39X、1.43X 和 1.61X。