The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using $k$-means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5 model, yielding a performance drop of only $1.9\%$/$2.5\%$ on the GLUE and SNI benchmarks compared to the original pre-trained checkpoints trained on $2.6$x/$745$x the amount of data. MiniPile is available at https://huggingface.co/datasets/JeanKaddour/minipile.
翻译:随着预训练文本语料库的不断增加,语言模型的泛化能力越来越强,适用于各种下游任务。然而,这些各异的数据集往往对于学术预算而言过于庞大;因此,大多数有关 Transformer 结构、训练过程、优化器等的研究都是在更小、同质的数据集上进行的。针对这一问题,我们提出了 MiniPile 挑战,即在包含最多 100 万个文档的多样化文本语料库上对语言模型进行预训练。MiniPile 是巨大的 The Pile 语料库中的一个包含 6 GB 数据的子集。我们使用简单的三步数据筛选过程对 MiniPile 进行了筛选:我们(1)对 The Pile 的所有文档推断出嵌入,(2)使用 k 均值聚类嵌入空间,(3)过滤掉低质量聚类。为了验证 MiniPile 对于语言模型预训练的适用性,我们使用它对 BERT 和 T5 模型进行了预训练,在 GLUE 和 SNI 基准测试中相对于原始预训练检查点的训练数据量减少了 2.6x / 745x,最终的性能下降仅为 1.9%/2.5%。MiniPile 数据集可在 https://huggingface.co/datasets/JeanKaddour/minipile 上获得。