The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intra-layer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as inter-layer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal inter-processor communication overhead. LayerPipe achieves an average speedup of 25% and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream.
翻译:培训神经网络所需的时间随着规模、复杂性和深度的增加而增加。通过回向转换进行的培训模型参数必然会产生反馈循环。这些循环会阻碍在层内和连续层之间高效的管道管线和任务时间安排。 先前的办法,例如管道Dream,利用延迟的梯度使用延迟的梯度来实现层间管道。 但是,这些办法将整个后向转换作为一个单项任务处理;这导致计算时间和处理器利用不足的增加。本文件介绍了新颖的优化办法,其中,对重量和激活功能的梯度计算是独立考虑的;因此,这些可平行计算。这被称为层内优化。此外,关于激活功能的梯度计算进一步分为两个部分,并分布到两个连续的层。这导致平衡的时间安排,其中每个层的计算时间是相同的。这被称为层间优化。 拟议的系统称为TeloopPipe, 减少培训所需的时钟周期数,同时最大限度地利用最小的处理器和激活功能; 因此,可以平行计算。 此外,关于激活功能的梯度计算梯度计算,将进一步分成两个部分,以平均速度为25; 与平均速度, 与平均速度,将PelePiPiPiPiPeb 和平均速度为25, 和平均速度为25。