The recent Natural Language Processing techniques have been refreshing the state-of-the-art performance at an incredible speed. Training huge language models is therefore an imperative demand in both industry and academy. However, huge language models impose challenges to both hardware and software. Graphical processing units (GPUs) are iterated frequently to meet the exploding demand, and a variety of ASICs like TPUs are spawned. However, there is still a tension between the fast growth of the extremely huge models and the fact that Moore's law is approaching the end. To this end, many model parallelism techniques are proposed to distribute the model parameters to multiple devices, so as to alleviate the tension on both memory and computation. Our work is the first to introduce a 3-dimensional model parallelism for expediting huge language models. By reaching a perfect load balance, our approach presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model parallelism. Our experiments on 64 TACC's V100 GPUs show that our 3-D parallelism outperforms the 1-D and 2-D parallelism with 2.32x and 1.57x speedup, respectively.
翻译:最近的自然语言处理技术以令人难以置信的速度更新了最先进的表现。 因此,培训巨大的语言模型是工业和学院的迫切需要。 但是,巨大的语言模型对硬件和软件都提出了挑战。 图形处理器(GPUs)经常迭代以满足爆炸性需求,并产生像TPU这样的各种ACIC。 然而,在超大型模型的快速增长和摩尔法律即将接近尾声之间仍然存在着紧张关系。 为此,许多模型平行技术提议将模型参数传播到多个设备,以缓解记忆和计算上的紧张。 我们的工作是首先引入加速大型语言模型的三维模型平行。 通过实现完美的负载平衡,我们的方法显示记忆和通信成本比现有的状态1D和2D模型的平行成本要小。 我们对64 TACC V100 GPUs的实验显示,我们的3D平行技术比1D和2D平行模型分别比2.32x和1.57x速度。