One of the most popular training algorithms for deep neural networks is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba. Despite its success in many applications there is no satisfactory convergence analysis: only local convergence can be shown for batch mode under some restrictions on the hyperparameters, counterexamples exist for incremental mode. Recent results show that for simple quadratic objective functions limit cycles of period 2 exist in batch mode, but only for atypical hyperparameters, and only for the algorithm without bias correction. %More general there are several more adaptive gradient methods which try to estimate a fitting learning rate and / or search direction from the training data to improve the learning process compared to pure gradient descent with fixed learningrate. We extend the convergence analysis for Adam in the batch mode with bias correction and show that even for quadratic objective functions as the simplest case of convex functions 2-limit-cycles exist, for all choices of the hyperparameters. We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown, but under the additional assumption of bounded gradients which does not apply to quadratic functions. The investigation heavily relies on the use of computer algebra due to the complexity of the equations.
翻译:深神经网络最受欢迎的培训算法之一是Kingma和Ba推出的适应性动力估计法(Adam)。尽管它在许多应用中取得了成功,但并没有令人满意的趋同分析:在超参数的某些限制下,只能显示对批量模式的局部趋同,反试也存在递增模式。最近的结果显示,对于简单的四边目标2期功能限制周期,以批量模式存在,但只适用于非典型超参数周期,而且只针对没有偏差校正的算法。% More一般情况下,还有几种适应性梯度方法试图从培训数据中估计出一个适当的学习率和/或搜索方向,以便改进学习过程,而不是以固定的学习速率纯梯度下降。我们扩展了对亚当的批量模式的趋同分析,以偏差校正模式加以纠正,并表明即使四方位目标功能也存在最简单的例子,即对超光度参数的周期。我们对这些限制周期的稳定性进行了分析,并将我们的分析与其他结果联系起来,这些结果显示大致趋一致,但根据额外假设的计算机梯度的梯度模型使用。