Merak:一个高效分配的DNND培训框架,为巨巨型基金会模式自动提供3D平行模式 (Merak: A Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models)

Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.

翻译：基础模型正在成为占主导地位的深层学习技术。基础模型的预演总是由于模型参数和培训数据集的庞大规模而具有时间重叠性。培训过程除了需要大量计算外, 培训过程是极其记忆密集和通信密集的。这些特点使得有必要应用三维平行主义, 其中包括数据平行、管道模型平行和强力模型平行, 以实现高培训效率。为了实现这一目标, 开发了一些定制软件框架, 如威坦- LM 和 DeepSpeed 。然而, 目前三维平行框架仍然满足了两个问题 : i) 它们对于模型开发者来说不透明, 需要手工修改模型来平行培训。 ii) 它们使用计算、 GPU 记忆和网络带带宽不够充分。我们提议采用3D 自动平行深度学习培训框架, 高资源利用率。 Merak 自动部署一个模型分区, 在模型的代理代表面上使用图形缩略图1. 1 和深层 Spepepea 。 Merak 也展示了州际 AI, 在基础模型上进行非侵入性自动自动自动的自动的自动自动自动自动自动计算, 模型的自动自动计算, 模型 1xx,,, 3x,, 和高级自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动的自动转换算法, 。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日