With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase can be time consuming due to the need to wait for \textit{stragglers}, i.e., slower workers. An efficient way to mitigate this effect is to let each worker wait only for updates from the fastest neighbors before updating its local parameter. The remaining neighbors are called \textit{backup workers.} To minimize the globally training time over the network, we propose a fully distributed algorithm to dynamically determine the number of backup workers for each worker. We show that our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.
翻译:由于对大规模机械学习模式培训的需求不断增加,最近提倡采用基于共识的分布式优化方法,作为流行参数服务器框架的替代物。在这个模式中,每个工人都对最佳参数矢量进行局部估计,并通过等待和平均从邻国获得的所有估计数来反复更新,然后根据当地数据集加以纠正。然而,由于需要等待\ textit{strgglers}(即工人速度慢),同步阶段可能耗时。减轻这一影响的一个有效办法是让每个工人在更新其本地参数之前只等待最快的邻居提供最新信息。剩下的邻居被称为\ textit{备份工人。}为尽可能缩短全球网络培训时间,我们建议一个完全分布的算法,以动态地决定每个工人的后备工人人数。我们表明我们的算法可以实现线性加速,以便趋同(即,与工人人数相比,趋同性的表现是线性地增加)。我们就MNIST和CIFAR-10进行广泛的实验,以核实我们的理论结果。