Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the inferior generalization performance of Adam is fundamentally tied to the nonconvex landscape of deep learning optimization.
翻译:亚当(Adam)等适应性梯度方法在深层学习优化中越来越受人欢迎。然而,据观察,与(随机)梯度下降相比,亚当可以聚集到不同的解决方案,在许多深层学习应用中,比如图像分类(即使有微调的正规化),其测试错误也明显更差。在本文中,我们对这一现象提供了理论解释:我们显示,在从同一种随机初始化开始的超度二度双层神经神经网络学习的非默认设置中,亚当(Adam)和梯度下降(GD)可以聚集到不同的全球培训目标解决方案中,即使与重量衰减正规化相比,也可能存在不同的一般性错误。相比之下,我们显示,如果培训目标是 convex,如果采用了重量衰减正规化,那么任何优化算法,包括亚当和GD在内的优化算法,如果培训成功的话,都会与同一解决方案汇合在一起。这表明亚当(Adam)的低常规化性表现基本上与深层学习优化的非康克斯景观联系在一起。