We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients---compressed gradients under error compensation and for local~SGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
翻译:我们分析(随机)梯度下降(SGD),对平稳准电流和非电流功能的延迟更新,并得出简洁、非抽吸、趋同率。我们显示,所有情况下的趋同率都由两个术语组成:(一) 不受延迟影响的随机化术语,不受延迟影响,和(二) 仅因延迟而线性减慢的更高排序确定性术语。因此,在有噪音的情况下,延迟效应在少数迭代和算法与标准 SGD相同的最佳速率之后变得微不足道。这一结果扩展了在无源化制度或强凝吸附四方函数中显示类似结果的研究线。我们进一步显示,SGD的类似结果,其形式更为复杂,在错误补偿下延迟的梯度压梯度梯度梯度,以及当地~SGD的多个工人在相互沟通之前进行本地步骤。在所有这些环境下,我们改进了已知的最佳比率。这些结果显示,SGDGD对压缩和/或延迟的计算方法非常可靠,在压缩和/或延迟的平流速度上,在平行速度上进行重要的方式进行平行更新。