Merak: 一种具有自动化 3D 并行性的高效分布式 DNN 训练框架，适用于巨型基础模型 (Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models)

Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.

翻译：基础模型正变得越来越受欢迎，但由于模型参数和训练数据集的大规模，对基础模型进行预训练始终是一项耗时的工作。除了计算密集度之外，训练过程还要消耗极大的内存和通信资源。这使得必须采用 3D 并行化技术，将数据并行化、管道模型并行化和张量模型并行化结合起来，以实现高效的训练。为此，一些自定义的软件框架，如 Megatron-LM 和 DeepSpeed 得以开发。然而，当前的 3D 并行性框架仍存在两个问题：i）它们不透明于模型开发者，需要手动修改模型进行并行化训练。ii）它们对计算、GPU 内存和网络带宽的利用不足。我们提出了 Merak，一种高资源利用率的自动化 3D 并行深度学习训练框架。Merak 自动部署，带有自动化模型分区器，该分区器在代理表示的模型上采用图分割算法。Merak 还提供了非侵入式的 API，以最少的代码修改扩展基础模型训练。此外，我们在 Merak 中设计了一个高性能的 3D 并行运行时引擎。它使用几种技术来利用可用的训练资源，包括移位关键路径管道调度，以提高计算利用率，阶段感知的重计算，利用空闲 worker 内存，子流水线的张量模型并行化，重叠通信和计算。在 64 个 GPU 上的实验证明，Merak 可以将 15 亿、25 亿、83 亿和 200 亿参数模型的训练性能提高高达 1.42X、1.39X、1.43X 和 1.61X。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/