We investigate the generalization and optimization properties of shallow neural-network classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$, we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2 /T)$ and generalization error $O(g(1/T)^2 /n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model's neural tangent kernel. For this and logistic-loss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/n)$. Our results differ from existing generalization outcomes using the algorithmic-stability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak-convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that resemble those found in the convex setting of linear logistic regression.
翻译:具有最小宽度的插值神经网络的泛化和稳定性
翻译后的摘要:
我们研究了在插值区间内通过梯度下降训练的浅层神经网络分类器的泛化和优化特性。具体而言,我们在实现情况下进行讨论,即模型权重可以实现任意小的训练误差$\epsilon$,且它们与初始化之间的距离是$g(\epsilon)$,我们证明梯度下降算法$n$个训练数据的迭代在$T$次时可以实现训练误差$O(g(1/T)^2/T)$和泛化误差$O(g(1/T)^2/n)$,前提是隐藏神经元至少有$m=\Omega(g(1/T)^4)$个。然后我们证明了实现情况下数据可以由模型神经切向核分离。我们对此和逻辑损失最小化证明了训练损失以$\tilde O(1/T)$的速度衰减,给定多项式对数个神经元$m=\Omega(\log^4(T))$。此外,当神经元$m=\Omega(\log^{4}(n))$且迭代次数$T \approx n$时,我们将测试损失的上限绑定为$\tilde {O}(1/n)$。我们的结果不同于使用算法稳定性框架的现有泛化结果,后者需要多项式宽度并产生次优的泛化率。我们分析的中心是使用新的自限制弱凸性性质,对于足够参数化的神经网络分类器,此性质导致了广义的局部拟凸性质。最终,尽管目标函数无凸性,在足够参数化的情况下,这导致了类似于线性逻辑回归凸性质中找到的收敛和泛化间隙限制。