To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for expediting synchronous pipeline-parallel training of modern large DNNs over arbitrary inter-GPU connectivity. Our algorithm framework comprises two components: a pipeline partition and device mapping algorithm, and a pipeline scheduler that decides processing order of microbatches over the partitions, which together minimize the per-iteration training time. We conduct thorough theoretical analysis, extensive testbed experiments and trace-driven simulation, and demonstrate our scheme can accelerate training up to 157% compared with state-of-the-art designs.
翻译:为了培训现代大型 DNN 模型,最近出现了管道平行关系,最近出现了现代大型 DNN 模型,这种平行关系正在出现,将模型分布在GPU上,使不同的装置能够在管道中处理不同的微插管。早期管道设计允许多种版本的模型参数同时存在(类似于不同步培训 ), 无法确保模型的趋同和准确性能与没有管道的相同。最近还提出了同步的管线,通过在培训迭代之间设置同步屏障,确保模型性能。然而,同步屏障需要等待所有微型管道的梯度聚合,从而拖延培训进度。优化管道规划需要优化的管道规划,以尽量减少这种等待,从而缩短培训时间,文献中对此研究不够充分。本文设计了高效的、接近最佳的算法,以加速同步的方式对现代大型DNNNN进行管道平行培训,而不是任意的GPUPU的连接。我们的算法框架由两个部分组成:管道隔断和装置绘图算法,以及一个管道调度器,决定处理隔段上的微型管线的顺序,从而最大限度地减少每次训练时间。我们进行第157 的模拟和跟踪模拟计划将第157 模拟和追踪计划演示,我们进行第157 演示到加速的进度规划。