In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and $R(\theta)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.
翻译:在过度平衡的模型中,悬浮梯度下沉的噪音暗含地使优化轨迹规范化,并确定哪些地方最低的SGD(SGD)相交。受经验研究的激励,这些研究显示,用噪音标签进行的培训能够改善一般化。我们研究SGD隐含的规范化作用,用标签噪音研究SGD(SGD)隐含的规范化效应。我们的分析发现,标签噪音使SGD与正常化损失的固定点($L(theta) ⁇ lambda R(theta) 美元($)相交汇,而美元是培训损失的固定点($L(theta) ) 。 美元是培训损失的固定点, 美元是当地最低SGD($) 。 美元是有效的规范化参数,取决于步骤大小、标签噪音强度和批量规模,而美元(Rhetta) 是明确的常规规范化因素,惩罚锐化最小化的最小化者。我们的分析发现,除了线级缩标规则外,大型学习率比全球大的趋同率外,还有另一个的大型学习率。