We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
翻译:本文提出MixtureVitae,一个旨在最小化法律风险同时提供强大模型性能的开放访问预训练语料库。MixtureVitae采用风险缓和的采集策略,结合公共领域和许可授权文本(如CC-BY/Apache)、经审慎论证的低风险补充材料(如政府作品和欧盟文本与数据挖掘适用来源),以及来源可追溯的定向指令、推理与合成数据。我们详细阐述了一个透明的多阶段处理流程,包括许可感知过滤、安全与质量筛选以及领域感知混合,并公开数据集与构建方案以支持可复现研究。在使用开放科学参考训练协议(参数规模固定为1.3亿/4亿/13亿/17亿;训练预算为500亿和3000亿词元)的对照实验中,基于MixtureVitae训练的模型在一系列标准基准测试中持续优于其他许可数据集,在17亿参数/3000亿词元的配置下,其表现超越FineWeb-Edu并在训练后期接近DCLM。该模型在数学/代码任务上表现尤为突出,在问答任务上具有竞争力。这些结果表明,许可优先、风险缓和的数据为训练高性能大语言模型提供了实用且法律风险可控的基础,在保持竞争力的同时减少了对无差别网络爬取的依赖。代码地址:https://github.com/ontocord/mixturevitae