Adaptive optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - \beta_2)$ training iterations.
翻译:亚当这样的适应性优化算法在深层学习中被广泛使用。 这种算法的稳定性往往随着学习率的暖化计划而得到改善。 受选择和调整暖化计划难度的驱使, 最近的工作提议对亚当的适应性学习率进行自动差异校正, 声称这一纠正方法( “ RADAM ” ) 超越了香草亚当算法, 并减少了亚当用暖化来进行昂贵的热调的需要 。 在这项工作中, 我们反驳了这一分析, 并为根据更新术语的大小( 与培训稳定性的关系更大) 进行暖化的必要性提供了另一种解释 。 我们接着提供了一些“ 规则” 的热调计划, 我们展示了在典型的实际环境下, 亚当 简单不调暖和的亚当 与 RADAM 的 相同。 我们最后建议, 开业者与亚当 坚持线性热调, 合理的默认是线性热超过 2 / ( 1 -\ beta_ 2 ) 。