We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the trained weights, as a function of the layer index, admits a scaling limit which is H\"older continuous as the depth of the network tends to infinity. The proofs are based on non-asymptotic estimates of the loss function and of norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
翻译:我们证明,在培训具有常态层宽度和平稳激活功能的深层残余网络方面,梯度下降到全球最低水平的线性趋同。我们进一步表明,作为层指数的一个函数,经过培训的重量允许一个因网络深度趋向无限而具有H\"连续不断的缩放限制。这些证据基于对梯度下降路径沿线网络重量损失功能和规范的非被动估计。我们用对受监督学习问题的详细数字实验来说明我们的理论结果与实际环境的相关性。