HPLT 3.0：面向大语言模型与机器翻译的超大规模多语言资源：单语与双语数据、多语言评估及预训练模型 (HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models)

Stephan Oepen,Nikolay Arefev,Mikko Aulamo,Marta Bañón,Maja Buljan,Laurie Burchell,Lucas Charpentier,Pinzhen Chen,Mariya Fedorova,Ona de Gibert,Barry Haddow,Jan Hajič,Jindřich Helcl,Andrey Kutuzov,Veronika Laippala,Zihao Li,Risto Luukkonen,Bhavitvya Malik,Vladislav Mikhailov,Amanda Myntti,Dayyán O'Brien,Lucie Poláková,Sampo Pyysalo,Gema Ramírez Sánchez,Janine Siewert,Pavel Stepachev,Jörg Tiedemann,Teemu Vahtola,Dušan Variš,Fedor Vitiugin,Tea Vojtěchová,Jaume Zaragoza

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

翻译：我们介绍一项持续进行的计划，旨在为近200种语言提供开放、超大规模、高质量且注释丰富的文本数据集。该数据集包含约30万亿词元，很可能是目前公开可用的最大规模多语言大语言模型预训练数据集合。这些数据集源自不同来源的网络爬取数据，并配套提供一套完整的开源处理流程，涵盖从网络存档中筛选文档、从HTML提取文本、对噪声文本进行语言识别、精确与近似去重、标注（包括语域标签、文本质量评估及个人可识别信息等）以及最终的选择与过滤。我们通过对比分析与统计检验、对24种语言的样本进行人工检查，以及基于该数据训练的不同语言模型架构的端到端评估，报告了数据质量的探查结果。针对多语言大语言模型评估，我们提供了一套涵盖九种欧洲语言的综合基准测试集合，特别强调原生创建的任务、缓解提示敏感性的机制以及精细化的分数归一化与聚合方法。此外，我们训练并评估了一个包含57个单语编码器-解码器模型的系列，以及若干单语GPT类参考模型。除单语数据与模型外，我们还展示了从该数据中自动挖掘的超大规模平行文本集合，以及通过机器翻译合成的新型平行语料库。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日