Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
翻译:基础模型正在成为占主导地位的深层学习技术。 基础模型的预演总是由于模型参数和培训数据集的庞大规模而具有时间重叠性。 培训过程除了需要大量计算外, 培训过程是极其记忆密集和通信密集的。 这些特点使得有必要应用三维平行主义, 其中包括数据平行、 管道模型平行和强力模型平行, 以实现高培训效率。 为了实现这一目标, 开发了一些定制软件框架, 如 威坦- LM 和 DeepSpeed 。 然而, 目前三维平行框架仍然满足了两个问题 : i) 它们对于模型开发者来说不透明, 需要手工修改模型来平行培训。 ii) 它们使用计算、 GPU 记忆和网络带带宽不够充分。 我们提议采用3D 自动平行深度学习培训框架, 高资源利用率。 Merak 自动部署一个模型分区, 在模型的代理代表面上使用图形缩略图1. 1 和深层 Spepepea 。 Merak 也展示了州际 AI, 在基础模型上进行非侵入性自动自动自动的自动的自动自动自动自动自动计算, 模型的自动自动计算, 模型 1xx,,, 3x,, 和高级自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动转换算法, 。