Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
翻译:大规模培训大型深层学习模式非常具有挑战性。 本文提出了齐梅拉(Chimera),这是一个将双向管道联合起来,高效培训大型模型的新型管道平行计划。 齐梅拉是一个同步方法,因此准确性不会丧失,这比非同步方法更有利于趋同。 与最新的同步管道方法相比,奇梅拉将泡沫数量减少高达50%;从双向管道的复杂时间安排中获益,奇梅拉拥有更平衡的激活记忆消耗。 以变换器为基础的语言模型进行了评估。 对于GPT-2模型,其13亿参数运行在皮兹·丹特超级计算机的2 048 GPU节点上,奇梅拉将培训吞吐量增加了1. 16x 2. 34x, 用于最新同步和无同步管道方法。