We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.
翻译:我们研究利用梯度下移和随机梯度下移来训练深神经网络的问题,特别是研究二进制分类问题,并表明对于一个拥有适当随机权重初始化的广泛的损失功能家庭来说,梯度下移和随机梯度下移都能找到一个超准深线单位(RELU)网络的培训损失全球迷你模型,这是对培训数据的轻度假设。我们证据的关键思想是高萨随机初始化,然后(随机)梯度下移,产生一系列循环状态,位于一个围绕初始权重的小型扰动区域,其中深重线下坡网络的经验损失功能享有良好的当地曲线特性,确保(随机)梯度下移的全球趋同。我们的理论结果为了解深层学习的最佳程度提供了启发,并为研究现代深层神经网络培训的最佳动态铺平坦了道路。