Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline bubbles during startup and tear-down reduce the utilization of accelerators. Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a significant number of bubbles cannot be filled using synchronous forward and backward passes. To address this problem, we suggest that extra work be assigned to the bubbles to gain auxiliary benefits in LLM training. As an example in this direction, we propose PipeFisher, which assigns the work of K-FAC, a second-order optimization method based on the Fisher information matrix, to the bubbles to accelerate convergence. In Phase 1 pretraining of BERT-Base and -Large models, PipeFisher reduces the (simulated) training time to 50-75% compared to training with a first-order optimizer by greatly improving the accelerator utilization and benefiting from the improved convergence by K-FAC.
翻译:管道平行化使得大型语言模型(LLMs)能够在大规模分布式加速器集群方面进行有效的培训。 然而,在启动和拆卸期间的管道泡沫降低了加速器的利用率。虽然已经提出了利用微吸附和双向管道的高效管道计划以最大限度地利用,但大量泡沫无法使用同步的前向和后向通行证来填补。为了解决这个问题,我们建议为泡沫分配额外工作,以便在LLM培训中获得辅助效益。作为这方面的一个例子,我们提议管道Fisher,将K-FAC的工作(一种基于渔业信息矩阵的二级优化方法)分配给加速聚合的气泡。在BERT-Base和-Large模型的第一阶段,PipeFisher将(模拟的)培训时间减少到50-75%,而第一阶优化器的培训则通过大大改进加速器的利用,并从K-FAC改进的合并中获益。