关于在培训深神经网络中存在关于梯度下降方法的全球微型和趋同分析 (On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks)

In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fr\'echet subdifferentiability). Our mathematical convergence analysis builds up on ideas from our previous article Eberle et al., on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional analysis such as the Arzel\`a-Ascoli theorem, on tools from nonsmooth analysis such as the concept of limiting Fr\'echet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed by Petersen et al.

翻译：在本文中,我们研究与完全相连的反馈的深ReLU ANNs, 以及任意大量隐藏的层层,并证明GD优化方法的风险与在培训此类ANNs时随机初始化GD优化方法的风险趋同,其假设是,被认为受监督的学习问题输入数据的概率分布的概率的不正规概率密度值是零碎的多式,其假设是,目标函数(描述输入数据和输出数据之间的关系)是零碎的多元性,其假设是,被认为受监督的学习问题的风险功能至少承认一个正常的全球性最低值。此外,在对浅度的ANNNS进行随机初始化初始化,同时在培训这种浅度的ANNNS的概率值值时证明,每个Lipschitz连续的目标函数都存在全球最低风险环境。最后,在对深层的ANNS(描述输入数据和产出数据之间的关系)的培训中,我们还研究梯度流(GF)差异方和我们证明,每一个非二维的GF轨迹轨迹至少是一个全球最低值,从一个隐藏的Ormalalder-alderalationalationalationalation 分析中,从我们的精确值概念中,从而建立一个稳定的软值,从而形成一个临界值分析。