Adaptive optimization algorithms such as Adam (Kingma & Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, Liu et al. (2019) propose automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we point out various shortcomings of this analysis. We then provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. Finally, we provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - \beta_2)$ training iterations.
翻译:亚当(Kingma & Ba, 2014)等适应性优化算法( Kingma & Ba, 2014) 被广泛用于深层学习。 这种算法的稳定性通常会随着学习速度的暖化而改善。 受选择和调整暖化时间表难度的驱使, 刘等人( 2019) 提议自动校正亚当的适应性学习率, 声称这一校正方法( “ RADAM” ) 比香草亚当算法( “ RADAM ” ) 高过香草亚当的演算法, 并减少对亚当的热调费用的需求。 在这项工作中, 我们指出了这项分析的各种缺点。 我们然后根据更新术语的规模( 与培训稳定性更相关), 提供了一种对暖化必要性的替代解释。 最后, 我们提供了一些“ 调暖化规则”, 我们证明在典型的实际环境中, 亚当 的热调暖化方式比RADAM 高或低。 我们的结论是, 实践者会坚持与Adam 直线式暖,, 合理的默认是2 / (\\\\\\\ 2) eta_ 2) 培训 。