We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
翻译:我们介绍一项持续进行的计划,旨在为近200种语言提供开放、超大规模、高质量且注释丰富的文本数据集。该数据集包含约30万亿词元,很可能是目前公开可用的最大规模多语言大语言模型预训练数据集合。这些数据集源自不同来源的网络爬取数据,并配套提供一套完整的开源处理流程,涵盖从网络存档中筛选文档、从HTML提取文本、对噪声文本进行语言识别、精确与近似去重、标注(包括语域标签、文本质量评估及个人可识别信息等)以及最终的选择与过滤。我们通过对比分析与统计检验、对24种语言的样本进行人工检查,以及基于该数据训练的不同语言模型架构的端到端评估,报告了数据质量的探查结果。针对多语言大语言模型评估,我们提供了一套涵盖九种欧洲语言的综合基准测试集合,特别强调原生创建的任务、缓解提示敏感性的机制以及精细化的分数归一化与聚合方法。此外,我们训练并评估了一个包含57个单语编码器-解码器模型的系列,以及若干单语GPT类参考模型。除单语数据与模型外,我们还展示了从该数据中自动挖掘的超大规模平行文本集合,以及通过机器翻译合成的新型平行语料库。