Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
翻译:Foundation模型正在成为支配深度学习技术的主要技术。由于模型参数和训练数据集的大规模,预训练Foundation模型通常需要很长时间。除了计算密集型,训练过程还非常内存密集型和通信密集型。这些特性使得将数据并行、管道模型并行和张量模型并行集成到3D并行中以实现高训练效率非常必要。为了实现这个目标,一些定制化软件框架,如Megatron-LM和DeepSpeed已经被开发。然而,目前的3D并行框架仍存在两个问题:i)它们对于模型开发者不透明,需要手动修改模型以并行训练。ii)它们的计算、GPU内存和网络带宽利用率不够高。我们提出Merak,一种具有高资源利用率的自动化3D并行的深度学习训练框架。Merak使用自动模型分区器进行自动部署,该分区器利用模型的代理表示进行图形分区算法。Merak还提供非侵入式API,以最小化代码修改实现Foundation模型训练的扩展。此外,我们在Merak中设计了一个高性能的3D并行运行时引擎。它使用多种技术来利用可用的训练资源,包括将关键路径管道调度向前移动以实现更高的计算利用率,意识到阶段重组可以利用空闲的工作存储器,以及子管道张量模型并行,可以重叠通信和计算。在64个GPU上的实验表明,相对于1.5, 2.5, 8.3和20亿参数的最先进的3D并行框架,Merak可以将培训性能加速到1.42X、1.39X、1.43X和1.61X。