Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
翻译:最近的研究显示,重尾和概括之间的关系在实践上可能并不总是单一的,这与现有理论的结论相反。在本研究中,我们通过算法稳定性的镜像,在尾尾部行为和一般性梯度下降(SGD)的尾部特性(SGD)之间建立起了新的联系。虽然这些研究揭示了现代环境中一般化行为的有趣方面,但它们依赖强烈的表面学和统计规律性假设,而在实践中很难核实这些假设。此外,从经验上可以证明,重尾部和概括性之间的关系在实践上可能并不总是单一的。与现有理论相反,我们通过算法稳定性的镜头,在尾部下降的尾部下降(SGD)尾部行为和一般性梯度下降(SGD)之间有新的联系。我们考虑二次裁量性优化问题,并使用重度差异性公式(及其分解性)作为模拟SGDD中重尾部行为的一种替代物。我们后来证明,如果以一般值损失($)的直径直值衡量,则SGGD的稳定性将不再稳定。