Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$ between a local minimum $\theta^*$ and a saddle $\theta^s$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $\Delta L=L(\theta^s)-L(\theta^*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h^*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.
翻译:在平均平方损失方面,SGD噪音的特性是通过随机的时间变化,用简单的添加性噪音来产生一个随机变换的随机微小微分方程(SDE)。我们用这种形式来显示,本地最小值$(theta ⁇ )/L(theta ⁇ ) 和马鞍$($) 之间的日志损失屏障$(Delta\log L ⁇ log)[L(L(theta ⁇ s)/L(L(theta ⁇ )/L(theta ⁇ )] /L(theta ⁇ )] $(美元)] 。这解释了一个经验性事实,即SGD偏好当地最小值的平板迷你米,但效率低,从而洞察到SGD的隐性偏差。