Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. As $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge under a wide range of hyperparameters {\it without any modification} on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.
翻译:自从Reddi等人2018年指出Adam的分歧问题以来,许多新的变量被设计为要达到趋同。 但是, Vanilla Adam 仍然非常受欢迎, 并且在实践中运作良好。 为什么理论和实践之间有差距? 我们指出理论和实践的设置之间存在不匹配: Reddi 等人 2018年在选择Adam的超参数后选择了问题, 也就是说, $( beta_ 1, \ beta_ 2) 美元; 而实际应用程序通常首先解决问题, 然后调用 $( beta_ 1, \ beta_ 2) 美元。 然而, 我们推测, 只有我们改变选择问题和超参数的顺序, 才会在理论和实践之间出现不匹配。 我们证明, 当 $eteta_ 2 美元 和 Net_ 1 美元的超参数之后, 将一个 demodemomental_ dislational2, 其区域会变为 。