Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ($\beta_1$ and $\beta_2$) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness condition in order to bridge the gap between theory and practice. The main contribution is to show theoretical evidence that Adam using small learning rates and hyperparameters close to one performs well, whereas the previous theoretical results were all for hyperparameters close to zero. Our analysis also leads to the finding that Adam performs well with large batch sizes. Moreover, we show that Adam performs well when it uses diminishing learning rates and hyperparameters close to one.
翻译:对适应方法,如适应性动动动估计(Adam)及其变式等的趋同率和趋同率分析进行了广泛的研究,以进行非调和优化。这些分析所依据的假设是:预期或经验平均损失功能是光滑的利普施茨(即其梯度是利普施茨连续的),学习率取决于利普施茨连续梯度的利普施茨常数。同时,对亚当及其变式的数值评估表明,使用小型不变学习率而不取决于利普施茨常数和超常数($\beta_1美元和$\beta_2美元),对于培训深度神经网络是有利的。由于计算利普施茨常数是硬的,利普施茨光滑度条件将是不现实的。本文提供了亚当的理论分析,而没有假设利普施奇茨连续梯度的平稳状态,以弥合理论和实践之间的差距。主要贡献是表明亚当使用小学习率和超常数接近于一个的理论证据,而先前的理论结果都是用来训练深度超常数网络网络网络。因为高音率在接近于亚当时,我们进行了接近零的深度的研究。我们还进行。