Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
翻译:最近的研究显示,重尾和概括性之间的关系在实践上并不总是单调的,与现有理论的结论相反。在本研究中,我们通过算法稳定性的镜像,在尾部行为和梯度下降(SGD)的概括性特性(SGD)之间建立起了新的联系。虽然这些研究揭示了现代环境中一般化行为的有趣方面,但它们依赖强烈的表面学和统计规律性假设,而在实践中很难加以核实。此外,我们从经验上表明,重尾部和概括性之间的关系在实践上可能并不总是单一的。我们通过算法稳定性的镜镜镜,在尾部梯度下降的尾部行为和概括性特性(SGD)之间建立起更明确的联系。我们考虑了二次二次整齐的优化问题,并使用一个严重尾部差异性差异性方程式,作为模拟SGDGD中出现的重尾部行为。然后,我们证明了统一的稳定性界限,这揭示了以下结果:(一)不做任何异性假设,我们表明,如果以正负值的正负值衡量我们的稳定性,那么SGD将无法稳定,我们的稳定性将比正正值更低的下值(x masstoxtoxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx