We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of the activation functions used in practice, including ReLU. We improve the existing state-of-the-art results in terms of the required hidden layer width. We introduce a new proof technique combining nonlinear analysis with properties of random initializations of the network. First, we establish the global convergence of continuous solutions of the differential inclusion being a nonsmooth analogue of the gradient flow for the MSE loss. Second, we provide a technical result (working also for general approximators) relating solutions of the aforementioned differential inclusion to the (discrete) stochastic gradient descent sequences, hence establishing linear convergence towards zero loss for the stochastic gradient descent iterations.
翻译:我们研究一个隐藏层向前神经网络的一类隐性层向导神经网络的随机梯度梯度下降算法全球趋同所需的超平衡界限,同时考虑到实践中所使用的大多数激活功能,包括RELU。我们改进了所需的隐藏层宽度方面的现有最新水平结果。我们引入了一种新的证明技术,将非线性分析与网络随机初始化特性结合起来。首先,我们建立了差异包容持续解决方案的全球趋同,即为MSE损失的梯度流的非移动模拟。第二,我们提供了技术结果(也为一般近似器工作),涉及上述差异包容的解决方案与(偏差的)随机梯度梯度梯度下降序列,从而建立了线性趋同,使随机梯度梯度梯度梯度下降发生零损失。