The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.
翻译:SGD(SGD) 偏向于平底梯度梯度下降(SGD) 的观察在理解 SGD 隐含的正规化和指导超参数调整方面发挥了根本作用。 在本文中,我们将SGD的特殊噪音结构与其 emph{线性稳定性(Wu 等人, 2018年) 联系起来, 以此从数量上解释这一惊人现象。 具体地说, 我们考虑以平方损失来培训超分度模型。 否则, SGD 将摆脱SGD的最小值 emph{线性稳定, 然后它必须满足 $H(theta ⁇ ) ⁇ F\leq O(sqrt{B}}/\eta) $($h(theta ⁇ ) F) {B, B,\\\\\\\ $(eta美元) 表示Hesian 的Frobenius标准, 美元、 批量和学习率率。 否则, SGDGD将快速地基值的精确值模型(根据Frobenal roberalal rolalalalalalalalal reck Stal) roal deal deal deal deal deal deal deal deal deal deal deal deal deal deald) exmal deal deal deal deal deal deal deal deal deal deald ex ex ex ex deal deal deal deal deal deal deal dex exm smal deal deal dealse, ex ex ex exm ex ex exm exm exm exm exms exm exm exm exm exm exm ex ex ex ex exm ex ex exm exm exmal deal deal deal deal deal deal deald ex ex exm exm exmmm exm exm ex ex ex ex ex ex ex ex ex exmmmmmalse ex ex ex ex ex ex ex ex