We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.
翻译:我们发现,现有的语言模型数据集包含许多近似重复的实例和长期重复的子字符串。 结果,在这些数据集培训的语言模型中,超过1%的未即期产出是从培训数据中逐字复制的。 我们开发了两个工具,让我们可以将培训数据集翻版 -- -- 例如,从C4中删除一个超过60 000次重复的61个单词英语句。 演练让我们能够对模型进行培训,这些模型可以少用10倍的记忆文本,少用培训步骤来实现相同或更好的准确性。 我们还可以减少火车测试重叠,这影响到标准数据集验证数据集的4%以上,从而可以进行更准确的评估。 我们在https://github.com/google-research/duprediscridi-t-t-dredictionsets。