Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).
翻译:SGD的经验成功往往归功于其计算效率和有利的普及行为,但两者的效果都没有被很好地理解,而且脱钩仍然是个尚未解决的问题。 即使在Convex二次曲线问题的简单设置中,最坏的个案分析也给出了SGD无症状的趋同率,这不比完全的梯度下降(GD)好,而SGD的所谓隐含的正规化效果缺乏准确的解释。 在这项工作中,我们研究了高度 convex二次曲线上的多频谱 SGD的动态,并建立了与随机差异等式的无症状等同性,我们称之为同性相色梯度梯度下降(HSGD),我们用伏特拉整体等式来明确描述其解决方案的无症状趋同性趋同性趋同性趋同率。 这些结果产生了学习和风险轨迹的精确公式,揭示了一种隐含调节机制,可以解释SGD相对于GD的效率。 我们还证明,在高度二次曲线流中,从SGD的隐性变现性SGD流中,最终使SGD的SGD的正值变现过程产生一种负风险。