Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.
翻译:深神经网络在具有随机梯度梯度下降的各种应用领域都成功地接受了培训,然而,没有严格的数学解释为什么这一方法如此成功。对具有随机梯度梯度下降的神经网络的培训有四个不同的分化参数:(一) 网络结构;(二) 培训数据的数量;(三) 梯度步骤的数量;(四) 随机初始化梯度轨迹的数量。虽然可以证明,如果所有四个参数都按照正确的顺序被送至无限度,近似误差会达到零,但我们在本文中表明,如果ReLU网络的深度远大于其宽度,随机初始化的数量不会增加至不精确度足够快,则神经梯度梯度下降无法为RELU网络聚合。