We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
翻译:我们显示,自动递减语言模型可以在对数据集应用直截了当的转换后学会填充文本,该转换只是将一段文字从文档中间移到结尾。虽然这一数据扩充近年来引起了很大的兴趣,但我们提供了大量证据,证明以这种方式转变的数据中很大一部分数据的培训模型不会损害最初的左对右基因化能力,这种能力是通过在广泛范围内的迷惑和抽样评估来衡量的。鉴于培训模型的有用性、简单性和效率,以填补中间部分(FIM),我们建议未来的自动递减语言模型默认地接受FIM培训。为此,我们运行了一系列关键超参数的推算,例如数据转换频率、变换结构以及选择填充间隔的方法。我们用这些推算来设定强大的默认设置和最佳做法,以培训FIM模型。我们已经发布了我们以最佳实践培训的AIPI(FIM)中的最佳填充模型,并释放了我们用来帮助未来研究的基准。