拼盘: 一套800GB的多种语言建模文本数据集 (The Pile: An 800GB Dataset of Diverse Text for Language Modeling)

Leo Gao,Stella Biderman,Sid Black,Laurence Golding,Travis Hoppe,Charles Foster,Jason Phang,Horace He,Anish Thite,Noa Nabeshima,Shawn Presser,Connor Leahy

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

翻译：最近的工作表明,培训数据集多样性的提高提高了大型语言模型的一般跨域知识和下游通用能力。为此,我们介绍了“Textit{the Pile}”825 GiB英文文本,目的是培训大型语言模型。“Pile”是由22个不同的高质量子集(既有的和新建的)组成的,其中许多来自学术或专业来源。我们对“Pile”上的GPT-2和GPT-3的不协调性能的评估表明,这些模型在很多组成部分上挣扎,例如学术写作。相反,在“Pile”的所有组成部分上,经过培训的“Pile”模型大大改进了“Raw CC”和“CC-100”,同时改进了下游评估的业绩。通过深入的探索分析,我们记录了潜在用户数据的各个方面。我们公开了构建中所使用的代码。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日