Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization ability compared with stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the second raw moment estimate term with its momentumized version in \textsc{Adam} can resolve the issue. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a more favorable option for learning rate scaling than that of the raw gradient. Thereby we propose \textsc{AdaMomentum} as a new optimizer reaching the goal of training fast while generalizing much better. We further develop a theory to back up the improvement in generalization and provide convergence guarantees under both convex and nonconvex settings. Extensive experiments on a wide range of tasks and models demonstrate that \textsc{AdaMomentum} exhibits state-of-the-art performance and superior training stability consistently.
翻译:适应性梯度方法,例如 \ textsc{Adam},在机器学习中取得了巨大的成功。 通过某种形式对梯度进行第二秒估计来提升学习率元素, 这种方法能够快速培训现代深神经网络。 然而, 观察到它们与随机梯度梯度下降(\ textsc{SGD})相比, 普遍化能力受到损害, 并且往往在培训的早期阶段就被困在本地迷你中。 有趣的是, 我们发现, 将第二个初始阶段的梯度估计值替换为在\ textsc{Adam} 中充满活力的版本, 能够解决问题。 直觉是, 具有动力的梯度包含更准确的方向信息, 因此, 其第二个时刻估计比原始梯度下降更有利于学习率的尺度。 因此, 我们提出\ textsc{AdaMomentumum} 是一个新的优化器, 快速达到培训目标, 同时要更加普及。 我们进一步开发一种理论, 来支持总体化的改进, 并且提供在 convex 和不那么高级的演示度的模型下, 展示一个宽度的测试。