The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the algorithm. Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider. For our analysis, we introduce a novel recursion based on "virtual iterates" and delay-adaptive stepsizes, which allow us to derive state-of-the-art guarantees for both convex and non-convex objectives.
翻译:目前对非同步随机梯度下降(SGD)的分析在出现任何延误时都会急剧下降,给人的印象是,性能主要取决于延迟。 相反,我们证明,无论梯度的延迟程度如何,对同样的非同步 SGD 算法的保障都大得多,这取决于用于执行算法的平行装置的数量。 我们的保证严格地说比现有的分析好,我们还争辩说,非同步 SGD 优于我们所考虑的设置中同步的微型球分SGD。为了进行分析,我们采用了基于“虚拟迭代国”和延迟适应级的新型循环法,这使我们能够为螺旋体和非convex目标获得最先进的保障。