We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations $N$. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$ and a momentum parameter on squared gradients $\beta_2=1-1/N$ achieves the same rate of convergence $O(\ln(N)/\sqrt{N})$ as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$. Our technique also improves the best known dependency for standard SGD by a factor $1 - \beta_1$.
翻译:当应用到平滑(可能非隐形)目标函数时,我们为亚当和阿达格勒适应性优化优化算法提供了简单的趋同证据,这些算法都包含有捆绑梯度的平方标准。我们表明,在预期中,轨道上平均客观梯度的正方标准有一个上限,在问题的常数、优化的参数和迭代总数中都明确存在这一上限。这一约束可以任意地缩小:以学习率$(alpha=1/\sqrt{N}$)和正方梯度的动力参数$($\beta_2=1/1)/N$($1/N$)和正方梯度梯度的动力参数,实现了与Adagrad相同的趋同率。最后,我们对非convex Adam 和 Adagrad 之前所有趋同线之间的重球动力有着最紧密的依赖性,从$O(1-\beta_1)美元提高到$O(1\beta_1)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\