适应性学习率及以后的差异 (On the Variance of the Adaptive Learning Rate and Beyond)

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.

翻译：学习热速率在稳定培训、加速趋同和改善适应性随机优化算法(如RMSpop和Adam)的一般化方面取得了显著成功。在这里,我们对其机制进行详细研究。在热化理论背后,我们找出了适应性学习率问题(即,在早期阶段存在问题的巨大差异),建议将热速率工作作为一种减少差异的技术,并提供经验证据和理论证据来核实我们的假设。我们进一步提议亚当的新变种RADAM,即亚当,引入一个术语来纠正适应性学习率的差异。关于图像分类、语言建模和神经机器翻译的广泛实验结果可以验证我们的直觉,并展示我们拟议方法的有效性和稳健性。所有实施情况见:https://github.com/LiyuanLucasLiu/RAdam。

相关内容

自适应学习

关注 10

自适应学习，也被称为自适应教学，是使用计算机算法来协调与学习者的互动，并提供定制学习资源和学习活动来解决每个学习者的独特需求的教育方法。在专业的学习情境，个人可以“试验出”一些训练方式，以确保教学内容的更新。根据学生的学习需要，计算机生成适应其特点的教育材料，包括他们对问题的回答和完成的任务和经验。该技术涵盖了各个研究领域和它们的衍生，包括计算机科学、人工智能、心理测验、教育学、心理学和脑科学。

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日