The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness -- as measured by the Frobenius norm of the Hessian -- is bounded \emph{independently} of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.
翻译:在理解 SGD 隐含的规范化方面, SGD 偏向于 平坦的梯度下降 (SGD) 现象在理解 SGD 隐含的规范化方面发挥了关键的作用。 在本文中,我们通过将 SGD 的特殊噪音结构与其 emph{线性稳定性(Wu 等人, 2018) (Wu 等人, 2018) 联系起来来解释这个惊人的现象。 具体地说, 我们考虑用平方损失来训练超度参数模型。 否则, SGD 将很快地从最低 emph{Explential 稳定起来。 因此, 它必须满足 $H(theta})\\\\\leq(leqrt{B}/\geeta) $ ($hhh(theta{theta{line sliminalityal) Oralityality reality reality reality real) 。 在Hesaltimemberal ral ral deal deal deal ladeal dal dal ex lade. Srbly Srmexal ex ex ex exal deal deal deal ex ex ex exmal exmet exmal ex exmal ex ex ex ex ex ex ex ex exlev ex ex ex ex ex ex extraluttal extral. S. S. Sral exml ex ex ex ex ex ex ex ex exm ex ex ex ex extra extra extra extra extra extra ex exl exl exl ex ex ex ex ex ex ex ex ex ex ex ex exmlbal exlal ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex