We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.
翻译:我们用在高维系统中的恒定梯度梯度梯度下(SGD)的缩放限制来研究高维系统中的常态梯度梯度梯度下(SGD)的缩放限制。我们证明,当SGD的简略统计轨迹(即有限维功能)随着其尺寸的无限性而限制其参数。我们的方法允许一个人选择所跟踪的汇总统计、初始化和跨度大小。它产生弹道(ODE)和diffusive(SDE)的缩放限制,其极限大大取决于以前的选择。我们展示了一个阶梯度的临界缩放制度,在此制度下,有效的弹道动态与人口流失的梯度流相匹配,但在此下,一个新的修正术语会改变阶段图。关于这一有效动态的固定点,相应的细微值限制可能相当复杂,甚至退化。我们展示了我们对于流行的例子,包括估算加压的矩阵和高压模型,并通过双层网络进行分类,取决于以前的选择。我们展示了惊人的现象,包括从多式时间缩缩缩缩缩度到最小化,从零度的概率化,然后展示了我们从零度的概率化的概率化的概率,从零比。