Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training.This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo back-prop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream out-of-order computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls.We evaluate our optimizations with twelve neural networks including a light-weight computer vision model (MobileNet) and largeNLP models (BERT and GPT-3) with up to forty eight V100 GPUs.Our scheduling algorithms effectively improve the performance of single-GPU training as well as data- and pipeline-parallel training.Compared to the respective state of the art training systems, the throughput is substantially improved for single-GPU, data-parallel, and pipeline-parallel training.
翻译:神经网络培训需要大量计算,因此,GPU往往用于加速。GPU在提高性能的同时,在培训期间没有充分利用GPU。本文件提议了一种神经网络培训的有效调度技术,即超序(ooo)背对流技术。通过利用梯度计算的依赖性,Ooo背对流能够重新排列其处决令,以充分利用GPU资源。我们表明,单GPU、数据平行线和管道平行线培训中的GPU利用率通常可以通过应用 Ooo 背对流和确定关键操作的优先次序来改进GPU的利用率。我们建议了基于 oo 背对流的三种调度算法。对于单GPU培训来说,我们用多流超序计算来遮盖内核舱发射的间接费用。在数据匹配和参数通信中,在管道培训中,我们将关键梯度计算优先用于减少输油管站。我们评估了12个神经网络的优化,包括轻量计算机运行的计算机运行模型(Mobal Net) 和大型GVPVPOL 系统, 以G AS AS AS 培训的单个模型(M) 。