MixtureVitae：基于许可优先文本源构建的高质量指令与推理数据开放网络规模预训练数据集 (MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources)

Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter, Xuan-Son, Vu,Jenia Jitsev

from arxiv, Code: \url{https://github.com/ontocord/mixturevitae}

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

翻译：我们提出了MixtureVitae，一个旨在最小化法律风险同时提供强大下游性能的开放访问预训练语料库。MixtureVitae遵循许可优先、风险缓和的采集策略，结合了公共领域和许可授权文本（如CC-BY/Apache）与经过审慎论证的低风险补充内容（如政府作品和欧盟文本与数据挖掘适用来源）。MixtureVitae采用简单的单阶段预训练方案，整合了高比例的许可合成指令与推理数据——这些通常在后续训练阶段引入的信号在许可网络语料库中普遍稀缺。我们将所有来源按风险等级分为三级，并提供分片级别的溯源元数据以支持风险感知使用。在使用开放科学参考训练协议（固定架构与超参数；130M至1.7B参数规模下50B和300B词元预算）的对照实验中，基于MixtureVitae训练的模型在一系列标准基准测试中持续优于其他许可数据集，在1.7B参数/300B词元的设置下，其性能超越FineWeb-Edu并在训练后期接近DCLM。该模型在MMLU以及数学与代码基准测试上表现尤为突出：一个基于300B MixtureVitae词元预训练的1.7B模型，在GSM8K、HumanEval和MBPP上达到或超越了强大的1.7B指令微调基线性能，尽管使用的词元数量减少了超过36倍（300B对比约11T）。基于彻底的污染分析支持，这些结果表明，按许可与溯源相关风险分级、具有高指令与推理密度的许可优先数据，能够为训练高性能大语言模型提供实用且风险可控的基础，在保持竞争力的同时减少对广泛网络爬取的依赖。代码：https://github.com/ontocord/mixturevitae