Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, \textit{etc.}, have been tried to promote Adam-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. This observation coupled with this sufficient condition gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without theoretical guarantee, we further give an analysis on how will the batch size or the number of nodes in the distributed system will affect the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or more number of nodes. At last, we apply the generic Adam and mini-batch Adam with a sufficient condition for solving the counterexample and training several different neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis.
翻译:亚当是培训深神经网络最有影响力的适应性随机算法之一。 即便在通过几个简单的反比实例的简单康夫克斯环境中,人们也指出,这种算法也存在差异。 许多尝试,例如降低适应性学习率、采用大批量规模、采用时间装饰技术、寻求类似的代用器、\ textit{etc.},试图推广亚当型算法,以汇集。 与现有方法相比,我们引入了一种易于检查的替代算法。 与现有方法不同, 我们引入了一种易于检查的充足条件,它仅仅取决于基础学习率参数和历史第二阶时段的组合,从而保证通用亚当在全球范围趋同,以解决大规模非康韦思沙首优化问题。 这种观察加上这种充分的条件,对亚当的偏差提供了更深得多的解释。 另一方面,在实践上,微型和分布式算法的算法在没有理论保证的情况下,我们分布式系统中的批量大小或节点数将如何影响亚当的趋近,而从理论上说来显示我们最接近或最接近的直径的直径的直径的亚当,最后的直径直径地运用于一个缩式的亚当。