Recent works have demonstrated great success in pre-training large-scale autoregressive language models on massive GPUs. To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate. However, such practice is often brittle and leads to a so-called stability-efficiency dilemma: increasing the batch sizes and learning rates leads to better training efficiency but can also result in training instability, leading to poor generalization accuracy or failed runs. To better understand this phenomenon, we conduct an in-depth analysis on large-scale pre-training experiments replicating the GPT-2 model. We find that there is a strong correlation between training instability and extreme values of gradient variance, and that samples with long sequence lengths contribute to these extreme gradient variance values, especially at the beginning of the training, indicating that long sequence length can be a main source of training instability. Based on the analysis, we present a Sequence Length Warmup method that aims to solve the training stability-efficiency dilemma. Experiments replicating GPT-2 models show that our approach enables stable training with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training instability. To achieve the same or better zero-shot evaluation results, our method reduces the required number of training tokens and wall clock time by up to 2.2x and 3.7x, respectively. Experiments replicating GPT-3 model (125M) show that our approach enables stable training with 8x larger batch size and 40x larger learning rate, and retains 99% of the zero-shot accuracy on 11 tasks using 10x less data and 17x less time compared to the original GPT-3 training recipe, while the baseline diverges under the same settings and only retain 95% of accuracy under lower learning rate.
翻译:最近的工作表明,在大规模GPT-2模型的大规模培训前大规模自动递减语言模型中取得了巨大成功。为了缩短工龄培训时间,常见的做法是增加批量规模和学习率。然而,这种做法往往不易,导致所谓的稳定效率两难:增加批量规模和学习率可以提高培训效率,但也可以导致培训不稳定,导致培训不稳,导致一般化准确性差或失败运行。为了更好地了解这一现象,我们深入分析了40个大规模培训前的精度实验,复制了GPT-2模型。我们发现,培训的不稳定性和梯度差异的极端值之间有着很强的关联性,而长序列长度的样本有助于这些极端的梯度差异值,特别是在培训开始时,这表明,长序列长度可以提高培训效率,但也可以导致培训不稳定性,但根据分析,我们提出了一个序列变暖方法,目的是解决17个培训效率难题。 复制GPT-2模型的实验显示,我们的方法可以使培训的精度达到8批次和4x的精度基线,比8个更大幅度的学习率要低,而基准方法则显示,比G2x的学习率要降低。