MuRating：面向多语言大语言模型预训练的高质量数据选择方法 (MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining) - 专知论文

会员服务 ·

0

预训练 · 数据选择 · 语言模型 · 高质量数据 · 数据质量 ·

2025 年 12 月 30 日

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

翻译：MuRating：面向多语言大语言模型预训练的高质量数据选择方法

Zhixun Chen,Ping Guo,Wenhan Han,Yifan Zhang,Binbin Liu,Haobin Lin,Fengze Liu,Yan Zhao,Bingni Zhang,Taifeng Wang,Yin Zheng,Meng Fang

from arxiv, NeurIPS 2025 poster

Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

翻译：数据质量是驱动大语言模型性能的关键因素，然而现有的基于模型的选择方法几乎完全专注于英语。我们提出了MuRating，一个可扩展的框架，能够将高质量的英语数据质量信号迁移到一个适用于17种目标语言的单一评分器中。MuRating通过成对比较聚合多个英语“评分器”以学习统一的文档质量分数，然后通过翻译将这些判断进行投射，进而在单语、跨语言和平行文本对上训练一个多语言评估器。应用于网络数据时，MuRating能够选择平衡的英语和多语言内容子集，用于预训练一个12亿参数的LLaMA模型。与包括QuRater、AskLLM、DCLM等在内的强基线方法相比，我们的方法在英语基准测试和多语言评估上的平均准确率均有提升，在知识密集型任务上提升尤为显著。我们进一步分析了翻译保真度、选择偏差以及叙事材料的代表性不足问题，并指出了未来的研究方向。

0

相关内容

预训练

在搭建网络模型时，需要随机初始化参数，然后开始训练网络，不断调整直到网络的损失越来越小。在训练的过程中，一开始初始化的参数会不断变化。当参数训练到比较好的时候就可以将训练模型的参数保存下来，以便训练好的模型可以在下次执行类似任务时获得较好的结果。

《用于代码弱点识别的 LLVM 中间表示》CMU

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【ACL2022】一种基于三阶张量同构的高效实体对齐译码算法, An Effective and Efficient Entity Alignment Decoding Algorithm via Third-Order Tensor Isomorphism

【ACL2022】一种基于三阶张量同构的高效实体对齐译码算法, An Effective and Efficient Entity Alignment Decoding Algorithm via Third-Order Tensor Isomorphism

专知会员服务

13+阅读 · 2022年3月24日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Arxiv

0+阅读 · 2025年12月31日

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Arxiv

0+阅读 · 2025年12月30日

VL-RouterBench: A Benchmark for Vision-Language Model Routing

Arxiv

0+阅读 · 2025年12月29日

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Arxiv

0+阅读 · 2025年12月25日

Fuzzwise: Intelligent Initial Corpus Generation for Fuzzing

Arxiv

0+阅读 · 2025年12月24日

VIP会员

文章信息

相关主题

高质量数据

相关VIP内容

《用于代码弱点识别的 LLVM 中间表示》CMU

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【ACL2022】一种基于三阶张量同构的高效实体对齐译码算法, An Effective and Efficient Entity Alignment Decoding Algorithm via Third-Order Tensor Isomorphism

【ACL2022】一种基于三阶张量同构的高效实体对齐译码算法, An Effective and Efficient Entity Alignment Decoding Algorithm via Third-Order Tensor Isomorphism

专知会员服务

13+阅读 · 2022年3月24日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

热门VIP内容

开通专知VIP会员享更多权益服务

《运用增强现实技术进行军事任务规划》130页

《高压决策环境中的人机协作》200页博士论文

《2025财年美陆军转型倡议（ATI）部队结构与组织提案》

《探索用于低层级任务区分与分类的转址旁路缓冲》

相关资讯

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

相关论文

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Arxiv

0+阅读 · 2025年12月31日

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Arxiv

0+阅读 · 2025年12月30日

VL-RouterBench: A Benchmark for Vision-Language Model Routing

Arxiv

0+阅读 · 2025年12月29日

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Arxiv

0+阅读 · 2025年12月25日

Fuzzwise: Intelligent Initial Corpus Generation for Fuzzing

Arxiv

0+阅读 · 2025年12月24日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员