Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
翻译:最近的工作表明,培训数据集多样性的提高提高了大型语言模型的一般跨域知识和下游通用能力。为此,我们介绍了“Textit{the Pile}”825 GiB英文文本,目的是培训大型语言模型。“Pile”是由22个不同的高质量子集(既有的和新建的)组成的,其中许多来自学术或专业来源。我们对“Pile”上的GPT-2和GPT-3的不协调性能的评估表明,这些模型在很多组成部分上挣扎,例如学术写作。相反,在“Pile”的所有组成部分上,经过培训的“Pile”模型大大改进了“Raw CC”和“CC-100”,同时改进了下游评估的业绩。通过深入的探索分析,我们记录了潜在用户数据的各个方面。我们公开了构建中所使用的代码。