Since its invention in 2014, the Adam optimizer has received tremendous attention. On one hand, it has been widely used in deep learning and many variants have been proposed, while on the other hand their theoretical convergence property remains to be a mystery. It is far from satisfactory in the sense that some studies require strong assumptions about the updates, which are not necessarily applicable in practice, while other studies still follow the original problematic convergence analysis of Adam, which was shown to be not sufficient to ensure convergence. Although rigorous convergence analysis exists for Adam, they impose specific requirements on the update of the adaptive step size, which are not generic enough to cover many other variants of Adam. To address theses issues, in this extended abstract, we present a simple and generic proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, Adabound, etc.). Our analysis only requires an increasing or large "momentum" parameter for the first-order moment, which is indeed the case used in practice, and a boundness condition on the adaptive factor of the step size, which applies to all variants of Adam under mild conditions of stochastic gradients. We also establish a variance diminishing result for the used stochastic gradient estimators. Indeed, our analysis of Adam is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non-convex optimization problems, including min-max, compositional, and bilevel optimization problems. For the full (earlier) version of this extended abstract, please refer to arXiv:2104.14840.
翻译:自2014年发明以来,亚当优化器一直受到极大关注。一方面,它被广泛用于深层学习,提出了许多变式,而另一方面,它们的理论趋同属性仍是一个谜。有些研究要求对更新进行强有力的假设,而这些假设不一定在实际中适用,而其他研究仍然遵循亚当最初有问题的趋同分析,这已证明不足以确保趋同。尽管对亚当进行了严格的趋同分析,但它们对适应性阶梯的更新提出了具体要求,而调整性阶梯规模的更新不够通用,无法涵盖亚当的许多其他变式。为了解决这些问题,我们以这种扩展的抽象方式为亚当式方法(包括亚当、AMSGrad、阿达蒙德等)提供了一个简单和通用的趋同证据。我们的分析只需要对亚当的最初阶段增加或扩大“动脉”参数,而这实际上还不足以确保趋同性。它们对于适应性阶梯规模的调整因素提出了具体要求,对于亚当的所有变式都不够通用。为了解决这些问题,我们用较宽的抽象的梯度梯度的梯度,我们用了一个简单梯度分析来确定一个变式的梯度的梯度分析。我们平的梯度的梯度。我们用变的梯度的梯度, 将渐变的阶梯度分析可以确定一个变结果,对于亚化的阶梯度的梯度,可以用来到平的梯度的梯度。