Optimization by gradient descent has been one of main drivers of the "deep learning revolution". Yet, despite some recent progress for extremely wide networks, it remains an open problem to understand why gradient descent often converges to global minima when training deep neural networks. This article presents a new criterion for convergence of gradient descent to a global minimum, which is provably more powerful than the best available criteria from the literature, namely, the Lojasiewicz inequality and its generalizations. This criterion is used to show that gradient descent with proper initialization converges to a global minimum when training any feedforward neural network with smooth and strictly increasing activation functions, provided that the input dimension is greater than or equal to the number of data points.
翻译:由于梯度下降的优化一直是“深层学习革命”的主要驱动因素之一。 然而,尽管近来在极其广泛的网络上取得了一些进步,但人们仍无法理解为何在培训深层神经网络时梯度下降往往会与全球迷你相融合。 本条提出了梯度下降向全球最小值趋同的新标准,比文献中现有的最佳标准(即Lojasiewicz不平等及其普遍性)更强大。 这一标准被用来表明,在培训具有平稳和严格增加激活功能的进取神经网络时,经过适当初始化的梯度下降会与全球最低值趋同,只要输入量大于或等于数据点的数量。