Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the nonconvex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT and ResNet, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the development of deep learning by reducing training cost and relieving engineering burden of trying different optimizers on various architectures. Code is released at https://github.com/sail-sg/Adan.
翻译:适应性梯度算法 借用了重球加速度的移动平均概念, 以估计准确的一级和二级梯度时间, 加速加速。 但是, Nesterov 加速度比重球加速度在理论和许多经验案例中都快, 在适应性梯度设置下调查得更少。 在这项工作中, 我们提议了 Adaptiive Nesterov 动力算法, Adan 短期, 以有效加快深神经网络的培训。 Adan 首次重新配置了 Vanilla Nesterov 加速度, 以开发一个新的 Nestal- 变异性动力估计( NME) 方法, 避免在外推点计算梯度的额外计算和内存。 然后, Adan 采用 NME 来估算适应性梯度加速度的第一和第二级加速度。 此外, 我们证明 Adan 在 $O (\ seplon) 中找到 $- port- port- sport stencial 站点 。 将OI- dreal train 的变现成本 到 和变压性变压性变压性网络。 SyTA- demodiversal 。