Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.
翻译:自从Reddi 等人( 2018) 指出Adam 的偏差问题以来, 许多新的变体被设计为获得趋同。 然而, Vanilla Adam 仍然非常受欢迎, 并且在实践中运作良好。 为什么理论和实践之间有差距? 我们指出理论和实践的设置之间存在不匹配: Reddi 等人( 2018) 在选择Adam的超参数后选择了问题, 即$( beta_ 1, \beta_ 2) ; 而实际应用程序通常先解决问题, 然后调出 $( beta_ 1, \beta_ 2) 。 然而, 美元的变异变 。 由于这一观察, 我们推测经验趋同在理论上是合理的, 只有我们改变选取问题和双参数的顺序。 在这项工作中, 当 $\ beta_ 2 和 美元 相同时, Adam 就会发现我们的任何变异端 。 当我们开始变异状态时, 其变异变形的结果是更深的 。