LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.
翻译:大型语言模型通常采用学习率预热,随后以余弦衰减至最大值的10%(即10倍衰减)的方式进行训练。在一项大规模实证研究中,我们发现在最优峰值学习率下,当在计算最优数据集规模下训练时,简单的线性衰减至零调度始终优于其他调度方案。线性衰减至零在多种模型规模、批次大小、数据集和词汇表范围内均表现出优越性,且随着数据集规模的增大,其优势愈加明显。基于对AdamW作为权重更新指数移动平均的新颖解释,我们阐明了线性衰减至零如何最优地平衡早期训练(远离初始条件)和后期训练(通过更多更新进行平均以减轻梯度噪声)的需求。实验表明,使用线性衰减至零训练的6.1亿参数模型,在每参数80个标记的训练量下,其损失低于使用10倍衰减训练每参数200个标记的模型,相当于实现了惊人的60%计算节省。例如,使用10倍衰减训练286个每参数标记的Llama2-7B模型,若采用线性衰减至零,很可能节省大部分计算成本。