Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale quadratically with the number of training data under standard initialization strategies used in practice for best generalization performance. In contrast, the most recent results obtain linear scaling either with requiring initializations that lead to the "lazy-training", or training only a single layer. In this work, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Lojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory.
翻译:超度度是指选择神经网络宽度的重要现象,即选择神经网络的宽度,使学习算法在非电解器培训中可以明显地达到零损失。现有理论利用各种初始化战略、培训修改和宽度缩放等方法确立了这种全球趋同。特别是,最先进的结果要求宽度与在最佳概括性业绩实践中采用的标准初始化战略下的培训数据数量成四面形。相比之下,最新的结果获得线性缩放,要么需要初始化,导致“懒惰训练”,要么只培训一个层。在这项工作中,我们提供了一个分析框架,使我们能够采用标准的初始化战略,可能避免懒惰训练,同时在基本浅线性神经网络中培训所有层,同时在网络宽度上达到理想的亚边宽度缩放。我们通过Polyak-Lojasiewicz条件、平稳和数据标准假设,并使用随机矩阵理论的工具,实现脱边线。