Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it impossible to fit large models into a single GPU or even a GPU server. Besides, it is highly necessary to reduce the training time for large models. Previous methods like Megatron-LM implemented a 1-Dimensional distributed method to use GPUs to speed up the training. However, these methods have a high communication overhead and a low scaling efficiency on large-scale clusters. To solve these problems, we propose Tesseract, a highly scalable tensor parallelism with a novel design. It increases efficiency by reducing communication overhead and lowers the memory required for each GPU. By introducing the novel dimension into tensor parallelism, Tesseract greatly increases the memory capacity of tensor parallelism. Concretely, this new dimension furthermore increases the degree of tensor parallelism. Compared to previous 1-D and 2-D methods, Tesseract manages to reduce the communication cost on each layer, resulting in speedups of 1.38x and 1.53x respectively with strong scaling. In weak scaling experiments, Tesseract achieves a maximum of 4.0/1.7 times inference speedup and 3.4/1.7 times throughput improvement compared to 1-D/2-D methods, respectively. By introducing Tesseract, we offer a more efficient and scalable way to implement large deep learning models with limited GPU resources.
翻译:在改进各种任务的最新精密度的同时,深层次学习模式正在大大扩大。然而,实施这些大型模型非常困难,因为有限的GPU记忆使得大型模型无法安装到单一的GPU或甚至GPU服务器中。此外,非常有必要缩短大型模型的培训时间。以前的方法,例如Meatacentron-LM,应用了一种一维分布法,使用GPU来加快培训速度。然而,这些方法在大型集群上具有很高的通信管理费和低比例化效率。为了解决这些问题,我们建议采取宇宙行动(Tesseract),一种高度可伸缩的拉标平行和新设计。它通过减少通信管理费和降低每个GPU所需要的存储量来提高效率。通过将新的维度引入多维度,Teseract(Tender) 应用了1x/1的缩放速度化方法。具体地说,这种新维度进一步提升了龙级平行度的程度。与以前的1D和2D方法相比,Teseract 管理着强大的降低每个层的通信成本, 将速度推缩缩缩缩为1x1xx0.1次。