MixtureVitae：基于许可优先文本源构建的高质量指令与推理数据开放网络规模预训练数据集 (MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources)

Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter, Xuan-Son, Vu,Jenia Jitsev

from arxiv, Code: \url{https://github.com/ontocord/mixturevitae}

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

翻译：本文介绍MixtureVitae，一个旨在最小化法律风险同时提供强大模型性能的开放访问预训练语料库。MixtureVitae采用风险缓和的采集策略，整合了公共领域和许可授权文本（如CC-BY/Apache）、经审慎论证的低风险补充材料（如政府出版物和欧盟文本与数据挖掘适用来源），以及来源可追溯的定向指令、推理与合成数据。我们详细阐述了一个透明的多阶段处理流程，包括许可证感知过滤、安全与质量筛选以及领域感知混合，并公开数据集与构建方案以支持可重复研究。在使用开放科学参考训练协议（固定架构参数为130M/400M/1.3B/1.7B；训练预算为500亿和3000亿词元）的对照实验中，基于MixtureVitae训练的模型在一系列标准基准测试中持续优于其他许可数据集，在1.7B参数/3000亿词元配置下，其性能超越FineWeb-Edu并在训练后期接近DCLM。该数据集在数学/代码任务上表现尤为突出，在问答任务上具备竞争力。这些结果表明，许可优先、风险缓和的数据为训练高性能大语言模型提供了实用且法律风险可控的基础，在保持竞争力的同时减少了对无差别网络爬取的依赖。代码地址：https://github.com/ontocord/mixturevitae

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日