Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, \textit{etc.}, have been tried to promote Adam-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. This observation, coupled with this sufficient condition, gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without any theoretical guarantee. We further give an analysis on how the batch size or the number of nodes in the distributed system affects the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or a larger number of nodes.At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis.
翻译:亚当是培训深神经网络最有影响力的适应性随机算法之一。 即使在简单的简单反比实例中,简单的二次曲线设置也显示存在差异。 许多尝试,例如降低适应性学习率,采用大批量规模,采用时间装饰技术,寻求类似的代用器,\ textit{etc.},试图推广类似亚当型算法,以便汇合。 与现有方法相比,我们引入了另一种容易核对的条件,这仅仅取决于基础学习率的参数和历史第二阶时的组合,以保证通用亚当在全球的趋同,以解决大规模非康韦克斯沙首优化。这一观察加上这一充分的条件,对亚当的偏差提供了更深入的解释。另一方面,在实践上,微型亚达姆和分布式的算法在没有任何理论保证的情况下被广泛使用。 我们进一步分析分布式系统中的批量或节点数量如何影响亚当的趋同, 而在理论上说, 最高级的直径网络和分布式阿达姆(Adam) 将一个更大的直径数据应用到一个更高的直径直径, 。