The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the algorithm. Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider. For our analysis, we introduce a novel recursion based on "virtual iterates" and delay-adaptive stepsizes, which allow us to derive state-of-the-art guarantees for both convex and non-convex objectives.
翻译:现有的异步 SGD 分析在任何时延较大时都会大幅降级,给人的印象是性能主要取决于时延。相反,我们证明,在任何梯度延迟时,相同的异步 SGD 算法在取决于实现该算法的并行设备的数量之后,会给出更好的保证。我们的保证严格优于现有的分析,并且我们还认为,在我们考虑的情况下,异步 SGD 要优于同步小批量 SGD。对于我们的分析,我们引入了一种基于“虚拟迭代”和延迟自适应步长的新型递归算法,这使我们能够推导出凸和非凸目标的最新保证。