The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.
翻译:深层学习的实证成功往往归功于SGD避免在损失场景中出现尖锐的当地迷你现象的神秘能力,因为人们知道尖锐的迷你现象会导致不全面化。最近,许多深层学习任务中报告了重尾梯度噪音的实证证据,许多深层学习任务中也报告了这种实证证据,我们在\c{S}Sim\c{s}ekli (2019a,b)中显示,SGD在如此密集的梯度噪音下可以摆脱尖锐的当地迷你现象,为神秘性提供了部分解决办法。在这项工作中,我们分析了SGD的流行变种,梯度在固定阈值之上脱落。我们表明,它实现了避免尖锐迷你的更强烈概念:它能够有效地完全从培训轨迹中消除尖锐的当地迷你度噪音。我们从重尾部噪音驱动的SGDDGD运动的动态。首先,我们表明,吸引场的临界门槛和宽度要求第一个退出时间的秩序,从相关的当地最低限值。此外,当客观功能满足适当的结构条件时,我们证明,随着学习速度下降速度下降速度下降,我们不断的SGMGDLA级的动力会得到更精确的精确的循环。