Adaptive optimization algorithms such as Adam (Kingma & Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, Liu et al. (2020) propose automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - \beta_2)$ training iterations.
翻译:亚当(Kingma & Ba, 2014)等适应性优化算法( Kingma & Ba, 2014) 被广泛用于深层学习。 这种算法的稳定性通常会随着学习速度的暖化而改善。 受选择和调整暖化时间表困难的驱使, 刘等人( 202020年) 提出亚当适应性学习率的自动差异校正, 声称这一校正方法( “ RADAM” ) 比香草亚当算法( “ RADAM ” ) 高出香草亚当的算法( RADAM ), 并减少了亚当用暖化进行昂贵调试的需要 。 在这项工作中, 我们反驳了这项分析, 并提供了基于更新术语规模的暖化必要性的替代解释, 这与培训稳定性更为相关。 我们随后提供了一些“ 规则” 的暖化计划, 我们展示了亚当在典型的实际环境中与RADAM 相同。 我们最后建议从业者与亚当 坚持线性暖,, 合理的默认是线性热度超过 2 / ( -\ et_ 2) 培训 。