Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating training of deep neural networks. But Adam often generalizes significantly worse than Stochastic Gradient Descent (SGD). It is still mathematically unclear how Adaptive Learning Rate and Momentum affect saddle-point escaping and minima selection. Based on the diffusion theoretical framework, we decouple the effects of Adaptive Learning Rate and Momentum on saddle-point escaping and minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a momentum drift effect to help passing through saddle points, and almost does not affect flat minima selection. This mathematically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. We design a novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD. Our real-world experiments demonstrate that Adai can significantly outperform SGD and existing Adam variants.
翻译:将适应学习率和运动动力结合在一起的适应动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动的动动动动动动动动动动作(Adam)是加速深层神经神经网络网络培训的最受欢迎的最受欢迎的振动动性优化。但是,亚当通常对适应性学习率和运动动动动动动脉动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动静动动脉动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动动