在RELU启动人工神经网络培训中梯度流的存在、独特性和趋同率 (Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation)

The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with ReLU activation. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point.

翻译：通过梯度下降(GD)类型优化计划进行纠正线性单元激活的人工神经网络培训(ANNS)目前是一个常见的工业相关程序。直到科学文献中的这一天,一般没有数学趋同分析来解释GD类型优化计划在RELU启动对ANNS的培训中取得的数字成功。GD类型优化计划可以被视为与考虑的优化问题相关的梯度流动(GF)差异方程式的暂时分解方法,因此,似乎是一个自然的研究方向,首先旨在为时间连续的GF差异方程式制定数学趋同理论,然后,在科学文献中的今天,一般没有数学趋同理论来解释GD类型优化计划在培训中是否成功。GD类型优化计划在培训中以一个隐藏的层和RELU激活的全线性对等方程式时分解方法中,我们首先在培训此类ANNF的概率理论中,假设GF的概率递归正值分布是每个关键值的每一个关键值,而我们所分析的每个关键值的指数值的每个关键值的每个方向值,我们所观察到的每个GF值的每个指数的每个方向的每个方向的每个方向的计算结果都是一个持续的计算结果。