Despite being tremendously overparameterized, it is appreciated that deep neural networks trained by stochastic gradient descent (SGD) generalize surprisingly well. Based on the Rademacher complexity of a pre-specified hypothesis set, different norm-based generalization bounds have been developed to explain this phenomenon. However, recent studies suggest these bounds might be problematic as they increase with the training set size, which is contrary to empirical evidence. In this study, we argue that the hypothesis set SGD explores is trajectory-dependent and thus may provide a tighter bound over its Rademacher complexity. To this end, we characterize the SGD recursion via a stochastic differential equation by assuming the incurred stochastic gradient noise follows the fractional Brownian motion. We then identify the Rademacher complexity in terms of the covering numbers and relate it to the Hausdorff dimension of the optimization trajectory. By invoking the hypothesis set stability, we derive a novel generalization bound for deep neural networks. Extensive experiments demonstrate that it predicts well the generalization gap over several common experimental interventions. We further show that the Hurst parameter of the fractional Brownian motion is more informative than existing generalization indicators such as the power-law index and the upper Blumenthal-Getoor index.
翻译:尽管这些界限被大大过度地夸大了,但人们认识到,由随机梯度梯度下降(SGD)所训练的深神经网络非常普遍,令人惊讶。根据Rademacher(Rademacher)对一套预先确定的假设的复杂性,已经制定了不同的基于规范的一般界限来解释这一现象。然而,最近的研究表明,这些界限可能存在问题,因为这些界限随着培训设置的尺寸的增加而增加,这与经验证据相反。在本研究中,我们争辩说,SGD(SGD)的假设是依赖轨迹的,因此可能比Rademacher(Rademacher)的复杂程度更加严格。为此,我们假设发生的随机梯度梯度噪声会跟随分数运动,以此来描述SGD的循环。我们然后从覆盖数字的角度确定Rademacher的复杂性,并将其与优化轨迹的Hausdorf层面联系起来。我们引用的假设,我们从深神经网络的假设中得出了一种新的概括化。广泛的实验表明,它预测了几个共同实验性干预措施的普遍化差距。我们进一步表明,“Huster-harst liforal graphal graphal graphal graphal graphal” graphal graphal graphal gration motion graphal graphal graphal graphal graphal motion graphal graphal motion motion mod motion motion lamental dol mod modal dal modal motion modal motional dol motion mod mod mod mod moction lament lament 而不是现有的的硬度指数的数值指标指标的参数比现有的数值指标更具有更具有更具有更具有更具有这样的数据性指标。