Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.
翻译:具有动力的石化梯度下降(SGD)被广泛用于培训现代深层学习结构。 使用动力可以加快不同环境的趋同率, 虽然人们清楚地认识到使用动力可以加快不同环境的趋同率, 但也观察到动力会提高一般化。 先前的工作表明, 动力稳定了SGD的噪音, 从而导致更普遍化。 在本文中, 我们从另一个角度和从经验上表明, 具有动力的梯度下降( GD+M ) 大大改善了一些深层学习问题中的梯度下降( GD+M ) 。 我们从这一观察中正式研究了动力如何改善总体化。 我们设计了一个二进制分类设置, 由GD+M 培训的一叠层( 超度) 共振动神经网络比由GD 培训的同一网络更好化。 当这两种算法同样初始化时, 我们分析中的主要洞察力是, 与这些示例具有某些特征但差值差异的数据集相比, 与GD相反, 我们的GD+M仍然学习了这些实验性数据中的特征, 我们最后通过理论验证了这些数据。