In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers, that otherwise degrade the benefit of outsourcing the computation. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. Previous works proposed to adapt the number of workers to wait for as the algorithm evolves to optimize the speed of convergence. In contrast, we model the communication and computation times using independent random variables. Considering this model, we construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm. Consequently, we improve the convergence speed of distributed SGD while significantly reducing the computation load, at the expense of a slight increase in communication load.
翻译:在分布式机器学习中,中央节点会将计算量较大的任务分发给外部计算节点。优化程序(如随机梯度下降SGD)的性质可以用来缓解计算节点响应缓慢或未响应的问题,这些节点被称为“Stragglers”, 不然的话无法获得计算外包的益处。这可以通过在每一次迭代中仅等待一个子集的计算节点完成其计算来实现。以前的工作提出了随着算法的演化来调整等待计算节点数量的方法以优化收敛速度。相反,我们使用独立的随机变量对通信和计算时间进行建模。基于这个模型,我们构建了一种新的方案,通过动态调整计算节点数量和计算负担来提高分布式SGD的收敛速度,同时显著降低计算负担,但会稍微增加通信负荷。