Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
翻译:若干实证研究中报告了悬浮梯度下沉(SGD)中的重尾现象。先前的实验证据表明,SGD尾部的重力和一般化行为之间有着强烈的相互作用。在理论上处理这种经验性现象方面,一些工程在地形学和统计学上作出了强有力的假设,将一般化错误与重尾部联系起来。最近,新的一般化界限已经得到证明,表明一般化错误和重尾部之间的非口头关系,这与所报告的经验性观察更为相关。虽然这些界限并不需要额外的表面假设,因为SGD可以使用重尾尾部的相容差异方程式(SDE)建模,但这些假设只能适用于简单的二次问题。在本文中,我们利用这一研究线和统计学假设,为更一般的客观功能类别(包括非convex的功能)制定一般化界限。我们的方法是发展Wasserstein稳定性,以重尾部的SDE及其离散化为界限,然后我们将其转换为一般化约束。我们的结果并不要求任何一般的经验性损失,而是要求任何一般的假设。