Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe
翻译:模型平行已成为培训现代大型深层语言模型的必要条件。 在这项工作中,我们从现有的模型平行方法中找出一个新的和正方形层面:由于基于变异语言模型的自动递减特性,有可能在一个单一培训序列内对基于变异器的语言模型进行编审平行。这样,与以前的工作相比,可以有一个更精细的编审管道。有了这个关键的想法,我们设计了TeraPipe,这是对基于变异器的语言模型和平行模型同步培训的一种高性能的象征性平行编程算法。我们开发了一种新的动态编程算法,以计算基于特定模型和集群配置的最佳管线执行计划。我们显示,TeraPipe可以加速5.0x对最大GPT-3模型的培训,该模型有1,750亿个参数,AWS群群,48个P3.16个参数,比州-艺术模型-平行方法大。复制代码见https://github.com/zhuohan123/terapipipa。