We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
翻译:我们分析了在样本数量和尺寸都很大的情况下,大批量随机梯度下降(SGD+M)在最小方位问题上的动态。 在此环境下, 我们显示 SGD+M 的动态随着尺寸的增加, 与确定性离散伏特拉方程式相融合。 我们分析的是, 我们确定一个稳定性测量, 隐含的调节率( ICR), 以调节SGD+M 加速算法的能力。 当批量数量超过该ICR时, SGD+M 则以$\mathcal{O}( 1/\ sqrt ~ kapapa} 以$( 1/\ sqrt~ kapa} ) 的线性趋近, 匹配最佳全批量动力( 特别是表演和整批量, 但尺寸的一小部分) 。 而对于小于ICR的批量, SGD+M 的速率则像单批量SGDD率的倍。 我们给出了实现这一效果的赫森光谱的学习率和动力参数的明确选择 。