Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of neurons in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit. Next, we show that the solution to the PDE converges in the training time to a zero-loss solution. Together, these results imply that training of the ResNet also gives a near-zero loss if the Resnet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
翻译:深神经网络(NN)的查找参数显示,适合培训数据是一个非convex优化问题,但基本的第一阶优化方法(渐降法)发现一种完全适合许多实际情况的全球解决方案。我们在一个限制制度下为残余神经网络(ResNet)研究这种现象,该现象具有平稳的激活功能,该限制制度规定,每个层(深度)的数量和每个层(宽度)的神经元数量都具有无限性。首先,我们使用一个平均场限制参数培训的参数梯度下降为部分差异方程式(PDE),该方程式将梯度流定性为大-NNN极限的概率分布。接下来,我们显示PDE的解决方案在培训时间会汇合为零损失解决方案。这些结果共同表明,如果Resnet足够大,Resnet的培训也会造成近零损失。我们估计了将损失降到一个临界值以下所需的深度和宽度,概率很高。