Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of weights in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a gradient flow for a probability distribution that is characterized by a partial differential equation (PDE) in the large-NN limit. Next, we show that under certain assumptions, the solution to the PDE converges in the training time to a zero-loss solution. Together, these results suggest that the training of the ResNet gives a near-zero loss if the ResNet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
翻译:深神经网络(NN)的查找参数,即适合培训数据是一个非隐蔽的优化问题,但基本的第一阶优化方法(渐降)在许多实际情况下发现一个完全适合(零损)的全球优化器。我们为残余神经网络(ResNet)研究这一现象,该网络具有平稳的激活功能,在一个限制制度下,层数(深度)和每个层(宽度)的重量数都达到无限程度。首先,我们使用一个平均范围参数培训的参数梯度下降为梯度流,其概率分布特征是大-零损。接下来,我们表明在某些假设下,对PDE的解决方案在培训时间里会汇合为零损解决方案。加在一起,这些结果显示,如果ResNet足够大,ResNet的培训将带来近零损失。我们估计了将损失降低到一个特定阈值以下所需的深度和宽度,概率很高。