This paper studies the global convergence of gradient descent for deep ReLU networks under the square loss. For this setting, the current state-of-the-art results show that gradient descent converges to a global optimum if the widths of all the hidden layers scale at least as $\Omega(N^8)$ ($N$ being the number of training samples). In this paper, we discuss a simple proof framework which allows us to improve the existing over-parameterization condition to linear, quadratic and cubic widths (depending on the type of initialization scheme and/or the depth of the network).
翻译:本文研究了深RELU网络在平方损失之下的梯度下降值在全球的趋同性。 对于这一背景,目前最新的结果显示,如果所有隐藏层规模的宽度至少为$\Omega(N_8)美元(培训样本数为N$美元),则梯度下降值会达到全球最佳水平。 在本文中,我们讨论了一个简单的证明框架,使我们能够将现有的超分度条件提高到线性、二次和三次宽度(取决于初始化计划的类型和/或网络深度)。