The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.
翻译:学习热速率在稳定培训、加速趋同和改善适应性随机优化算法(如RMSpop和Adam)的一般化方面取得了显著成功。在这里,我们对其机制进行详细研究。在热化理论背后,我们找出了适应性学习率问题(即,在早期阶段存在问题的巨大差异),建议将热速率工作作为一种减少差异的技术,并提供经验证据和理论证据来核实我们的假设。我们进一步提议亚当的新变种RADAM,即亚当,引入一个术语来纠正适应性学习率的差异。关于图像分类、语言建模和神经机器翻译的广泛实验结果可以验证我们的直觉,并展示我们拟议方法的有效性和稳健性。所有实施情况见:https://github.com/LiyuanLucasLiu/RAdam。