Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
翻译:尽管Adam优化器在实践中广受欢迎,但大多数理论分析以随机梯度下降(SGD)作为Adam的代理进行研究,对于Adam所得解与SGD的差异知之甚少。本文表明,Adam隐式地减少了一种由其自适应更新塑造的独特锐度度量,从而产生与SGD性质不同的解。具体而言,当训练损失较小时,Adam在最小化流形附近游走,并采用半梯度以自适应方式最小化该锐度度量,我们通过基于随机微分方程的连续时间近似严格刻画了这一行为。我们进一步在一个经过充分研究的场景中展示了该行为与SGD的差异:在使用标签噪声训练过参数化模型时,SGD已被证明可最小化Hessian矩阵的迹$\\tr(\\mH)$,而我们证明Adam最小化的则是$\\tr(\\Diag(\\mH)^{1/2})$。在利用对角线性网络求解稀疏线性回归问题时,这一区别使Adam能够获得比SGD更好的稀疏性与泛化性能。最后,我们的分析框架可扩展至包括RMSProp、Adam-mini、Adalayer和Shampoo在内的广泛自适应梯度方法类别,并为这些自适应优化器如何降低锐度提供了统一视角,我们希望这能为未来优化器设计提供启示。