Stochastic Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. For example, while Bulk Synchronous Parallel (BSP) often achieves better converged accuracy, the corresponding training throughput can be negatively impacted by stragglers. In contrast, Asynchronous Parallel (ASP) can have higher throughput, but its convergence and accuracy can be impacted by stale gradients. To improve the performance of synchronization protocol, recent work often focuses on designing new protocols with a heavy reliance on hard-to-tune hyper-parameters. In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP, i.e., reducing training time while simultaneously maintaining the converged accuracy. Based on extensive empirical profiling, we devise a collection of adaptive policies that determine how and when to switch between synchronization protocols. Our policies include both offline ones that target recurring jobs and online ones for handling transient stragglers. We implement the proposed policies in a prototype system, called Sync-Switch, on top of TensorFlow, and evaluate the training performance with popular deep learning models and datasets. Our experiments show that Sync-Switch achieves up to 5.13X throughput speedup and similar converged accuracy when comparing to BSP. Further, we observe that Sync-Switch achieves 3.8% higher converged accuracy with just 1.23X the training time compared to training with ASP. Moreover, Sync-Switch can be used in settings when training with ASP leads to divergence errors. Sync-Switch achieves all of these benefits with very low overhead, e.g., the framework overhead can be as low as 1.7% of the total training time.
翻译:沙粒渐变源(SGD)已成为在分布式组群中训练深神经网络的实际方法。决定培训流程和模型精度的一个关键因素是选择参数同步协议。例如,当Bulk同步平行平行(BSP)通常能更趋一致时,相应的培训流程可能会受到累进器的负面影响。相比之下,非同步平行平行(ASP)可以拥有更高的输送量,但其趋同性和准确性会受到平滑性梯度的影响。为了改进同步协议的性能,最近的工作往往侧重于设计新的协议,并高度依赖硬通调高分流的超参数。例如,在本文中,我们设计一种混合同步方法,利用BSP和ASP的效益,即减少培训时间,同时保持趋同的准确性。根据广泛的经验分析,我们制定一套适应性政策,确定如何和何时转换平流的同步协议。为了改进同步协议,我们的政策包括针对重复性和在线培训的操作,同时进行快速调和在线的对调低调。我们用SPS的变换速度,我们用A-S的高级性变压,我们用A-S的原型系统来显示我们使用的性变压的性变压,我们使用的SDRal-S。我们用AxAx的模型来显示我们所用的的性变压的系统,我们使用的性变压的变换的系统,我们用Ax的变压式的变换的变换的系统,我们用原型的变压的变压的变压的变压的变压的变压式的变换的系统。