We study the optimization of wide neural networks (NNs) via gradient flow (GF) in setups that allow feature learning while admitting non-asymptotic global convergence guarantees. First, for wide shallow NNs under the mean-field scaling and with a general class of activation functions, we prove that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. Building upon this analysis, we study a model of wide multi-layer NNs whose second-to-last layer is trained via GF, for which we also prove a linear-rate convergence of the training loss to zero, but regardless of the input dimension. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
翻译:我们研究通过梯度流优化宽度神经网络(NNs)的设置,这种设置既允许特征学习,又承认非非无线性全球趋同保证。 首先,对于在中野规模和一般启动功能类别下宽度浅度的NNS, 我们证明,当输入层面不小于培训组合的规模时,在GF下,培训损失以线性速度上升到零。 在此基础上,我们研究一个通过GF培训其第二至最后一层的广泛多层多层NTNS的模式,为此,我们也证明培训损失的线性趋同为零,但无论投入层面如何。 我们还从经验上表明,与Neural Tangent Kernel(NTK)制度不同,我们的多层模型展示以学习为特征,并能够实现比NTK对口单位更好的普及性表现。