We give a simple proof for the global convergence of gradient descent in training deep ReLU networks with the standard square loss, and show some of its improvements over the state-of-the-art. In particular, while prior works require all the hidden layers to be wide with width at least $\Omega(N^8)$ ($N$ being the number of training samples), we require a single wide layer of linear, quadratic or cubic width depending on the type of initialization. Unlike many recent proofs based on the Neural Tangent Kernel (NTK), our proof need not track the evolution of the entire NTK matrix, or more generally, any quantities related to the changes of activation patterns during training. Instead, we only need to track the evolution of the output at the last hidden layer, which can be done much more easily thanks to the Lipschitz property of ReLU. Some highlights of our setting: (i) all the layers are trained with standard gradient descent, (ii) the network has standard parameterization as opposed to the NTK one, and (iii) the network has a single wide layer as opposed to having all wide hidden layers as in most of NTK-related results.
翻译:我们简单地证明,在培训深RELU网络时,梯度下降与标准平方损失是全球趋同的,并表明其相对于最先进的技术的一些改进。特别是,尽管先前的工程要求所有隐藏层宽度至少为$\Omega(N$8美元)(培训样本数量为$8美元),但我们需要单宽层线性、四面性或立方宽度,这取决于初始化的类型。与许多最近根据Neural Tangent Kernel(NTK)提供的证据不同,我们的证据不需要跟踪整个NTK矩阵的演变,或者更一般地说,与培训中激活模式的变化有关的任何数量。相反,我们只需要跟踪最后一个隐藏层产出的演变情况,由于RELU的Lipschitz特性可以更容易地完成。我们设置的一些亮点:(一)所有层都经过标准梯度下降的训练,(二)网络有标准参数化,而不是NTK 1,以及(三)网络有一个单一宽层,而不是与最大层与NTK相关的所有隐藏的层结果。