We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$.
翻译:我们提供了一种简单的趋同证据,既包括亚当和阿达格勒适应性优化算法,在应用到平滑(可能非隐形)目标函数时,可以使用捆绑梯度。我们显示,在预期中,轨道上平均客观梯度的正方标准有一个上限,在问题的常数、优化参数、维度美元和迭代总数中都清楚可见。这一约束可以任意地变得小一些,而且如果右超参数,亚当可以显示与相同的趋同速率($(d\ln(N)/sqrt{N})。但是,在使用默认参数时,亚当没有将目标梯度平均的正方标准趋同起来,不过,它从初始点移动的速度比Adagrad要快,这可能解释其实际成功与否。最后,我们在非康韦x Adam 和 Adagrad 之前的所有趋同捆绑中,对重球力加速度衰减速率($\beta_1美元)和美元($___B_1}1美元)改进到美元($_B___1}____1美元)。