争取从理论上理解为什么SGD在深层学习中普遍比ADAM更好 (Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning)

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones , our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

翻译：目前还不清楚为什么ADAM这样的适应性梯度算法尽管培训速度较快,却比SGD更糟糕。这项工作的目的是通过分析其本地趋同行为来了解这一普遍化差距。具体地说,我们观察这些算法中梯度噪音的重尾。这促使我们通过Levy驱动的随机差异方程式(SDEs)分析这些算法,因为一个算法及其SDE的趋同行为相似。然后我们确定这些SDE从本地盆地逃出的时间比SGD更快。其结果是:(1) SGD和ADAM~dedeaped 两者的脱向时间都在于分析这一普遍差距。具体地说,SGDDD(SG)的指数性梯度和AD-AD(AD)的脱落时间,以及梯度的上升时间比ADAM(SADAA)要短, 其地方梯度的梯度比SADAM(SADAD)更精确。