Recent works have demonstrated great success in training high-capacity autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of unlabeled text corpus for text generation. Despite showing great results, this generates two training efficiency challenges. First, training large corpora can be extremely timing consuming, and how to present training samples to the model to improve the token-wise convergence speed remains a challenging and open question. Second, many of these large models have to be trained with hundreds or even thousands of processors using data-parallelism with a very large batch size. Despite of its better compute efficiency, it has been observed that large-batch training often runs into training instability issue or converges to solutions with bad generalization performance. To overcome these two challenges, we present a study of a curriculum learning based approach, which helps improves the pre-training convergence speed of autoregressive models. More importantly, we find that curriculum learning, as a regularization method, exerts a gradient variance reduction effect and enables to train autoregressive models with much larger batch sizes and learning rates without training instability, further improving the training speed. Our evaluations demonstrate that curriculum learning enables training GPT-2 models (with up to 1.5B parameters) with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training divergence. To achieve the same validation perplexity targets during pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 59% and 54%, respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results at the end of pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 13% and 61%, respectively.
翻译:最近的工作表明,在培训高能力自动递减语言模型(GPT, GPT-2, GPT-2, GPT-3)方面,在大量无标签的文本软件库以生成文本方面,在培训高能力自动递减语言模型(GPT, GPT-2, GPT-3)方面,取得了巨大成功。尽管取得了巨大成果,但这产生了两个培训效率挑战。首先,培训大型公司体可以极其消耗时间,如何向模型展示培训样本样本,以提高象征性趋同速度,这仍然是一个挑战性和开放性的问题。第二,许多大型模型必须经过数百甚至数千个处理器的培训,这些处理器使用数量非常大的数据对齐(GPT, GPT, GPT, GPT, GPT, GPT, GPT, GPL), 并且通过不训练不稳定性能, 观察到大批量培训问题, 大批培训往往会发生不稳定性问题或趋同的解决方案。