Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. We also show the relevance of our result to deep learning by presenting a minimal neural network example. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.
翻译:然而,以前关于SGD的工作往往依赖于对SGD噪音性质的限制性和不现实的假设。在这项工作中,我们用数学来构建一些与以前对SGD的理解相悖的例子。例如,我们的建筑表明:(1) SGD可能与当地的最高值趋同;(2) SGD可能任意地缓慢地逃过一个马鞍点;(3) SGD可能更喜欢尖锐的迷你,而不是平坦的;(4) AMSGrad可能与当地的最高值趋同。我们还通过提出一个最小的神经网络范例,表明我们的成果与深层次的学习的相关性。我们的结果表明,SGD的噪音结构可能比神经网络培训中丢失的场景更为重要,而且未来的研究应当侧重于深层学习中的实际噪音结构。