Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute coordinate-wise adaptive step sizes, achieving better convergence than SGD in face of noisy objectives. However, Adam can have undesirable convergence behavior due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive learning rates of Adam in the later stage of training, but they do not outperform Adam in some practical tasks such as training Transformers. In this paper, we propose an adaptive learning rate principle, in which the running mean of squared gradient is replaced by a weighted mean, with weights chosen to maximize the estimated variance of each coordinate. This gives a worst-case estimate for the local gradient variance, taking smaller steps when large curvatures or noisy gradients are present, which leads to more desirable convergence behavior than Adam. We prove the proposed algorithm converges under mild assumptions for nonconvex stochastic optimization problems, and demonstrate the improved efficacy of our adaptive averaging approach on image classification, machine translation and natural language understanding tasks. Moreover, our method overcomes the non-convergence issue of Adam in BERT pretraining at large batch sizes, while achieving better test performance than LAMB in the same setting. The code is available at https://github.com/zhuchen03/MaxVA.
翻译:RMSProp 和 Adam 等适应性梯度方法,如RMSProp 和 Adam 使用对平方梯度的指数移动估计来计算协调性适应性步数,在面对吵闹的目标时实现比SGD更好的趋同。然而,Adam 可能会由于不稳定或极端的适应性学习率而产生不可取的趋同行为。 诸如AMSGrad 和 AdaBound 等方法在培训的后阶段为稳定亚当的适应性学习率,但在培训变异器等一些实际任务中并不比Adam好。 本文中,我们提出了一个适应性学习率原则,用加权平均值取代平方梯的运行平均值,并选择权重以尽量扩大每个坐标的估计差异。 这为本地梯度差异提供了最坏的估测法,在出现大弯曲或噪音梯度时采取较小的步骤,从而比Adam 更可取的趋同的趋同行为。 我们证明拟议的算法在非conx 沙变优化问题等的轻假设下趋于一致,并显示我们在图像分类、机器翻译翻译和自然语言理解任务方面的适应性方法的功效。 此外,我们的方法克服了非CADMB/MVA的大规模考试前的成绩测试的成绩测试,而没有进行。