在多语言教学培训中,神经机能翻译到语言平衡方面是如何强健的? (How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?) - 专知论文

会员服务 ·

0

Performer · 词元分析器 · 稳健性 · Machine Translation · 样本 ·

2022 年 9 月 11 日

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

翻译：在多语言教学培训中,神经机能翻译到语言平衡方面是如何强健的?

Shiyue Zhang,Vishrav Chaudhary,Naman Goyal,James Cross,Guillaume Wenzek,Mohit Bansal,Francisco Guzman

from arxiv, AMTA 2022

A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling for tokenizer training from sampling for model training and show that the model is more sensitive to the latter.

翻译：多语种象征器是多语种神经机器翻译的一个基本组成部分。它从多语种材料中接受培训。由于偏斜的数据分配被认为是有害的,通常使用抽样战略来平衡文体中的语言。然而,很少有工作系统地解答了象征器培训中语言不平衡如何影响下游业绩。在这项工作中,我们分析翻译性能如何变化,因为各语言的数据比率在代言器培训中各不相同。我们发现,虽然在语言抽样比较平等的情况下,表现相对好,但下游的性能比我们通常预期的要强得多。两个特征,即UNK费率和接近字性水平,可以在执行任务前警告下游表现不佳。我们还区分代言培训的语言抽样和示范培训的抽样,并表明该模式对后者更为敏感。

0

相关内容

Performer

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

靶向抑制 MNK-eIF4E 轴增效TRAIL治疗鼻咽癌的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

雄激素受体在膀胱癌进展中对GATA3的调控机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

HIPIMS制备环境自适应C/MoSx复合润滑薄膜的界面结构调控及摩擦机理

国家自然科学基金

0+阅读 · 2012年12月31日

N-3PUFA降低肥胖型胰岛素抵抗大鼠血脂及提高胰岛素敏感性的机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

第二相金属纳米颗粒掺杂对SrTiO3体系热电性能的调控研究

国家自然科学基金

0+阅读 · 2012年12月31日

LIMK1：罗格列酮抑制人胃癌细胞增殖、迁移及侵袭的作用靶点

国家自然科学基金

0+阅读 · 2012年12月31日

孔结构可控介孔沸石的合成及性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

Pim-3促进自噬对脓毒血症所致肾小管上皮细胞损伤的保护作用

国家自然科学基金

0+阅读 · 2012年12月31日

RGM与neogenin信号调控应激性精神障碍-PTSD杏仁核、海马神经细胞凋亡的分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

Practical Real Video Denoising with Realistic Degradation Model

Arxiv

0+阅读 · 2022年10月21日

Is Encoder-Decoder Redundant for Neural Machine Translation?

Arxiv

0+阅读 · 2022年10月21日

Can Domains Be Transferred Across Languages in Multi-Domain Multilingual Neural Machine Translation?

Arxiv

0+阅读 · 2022年10月20日

What Do Compressed Multilingual Machine Translation Models Forget?

Arxiv

0+阅读 · 2022年10月20日

Machine Learning for $K$-adaptability in Two-stage Robust Optimization

Arxiv

0+阅读 · 2022年10月20日

Palm up: Playing in the Latent Manifold for Unsupervised Pretraining

Arxiv

0+阅读 · 2022年10月19日

Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Arxiv

0+阅读 · 2022年10月19日

A Linguistic Investigation of Machine Learning based Contradiction Detection Models: An Empirical Analysis and Future Perspectives

Arxiv

0+阅读 · 2022年10月19日

Unifying the Convergences in Multilingual Neural Machine Translation

Arxiv

0+阅读 · 2022年10月19日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

VIP会员

文章信息

相关主题

词元分析器

Machine Translation

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICCV2025教程】基础模型遇见具身智能体

军事机器学习设计：关于开发自动化任务摘要系统的梯次化设计科学研究 | 2025最新93页

扩散模型中的缓存方法综述：迈向高效的多模态生成

【ICCV2025教程】《迈向视觉语言模型的全面推理》

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

相关论文

Practical Real Video Denoising with Realistic Degradation Model

Arxiv

0+阅读 · 2022年10月21日

Is Encoder-Decoder Redundant for Neural Machine Translation?

Arxiv

0+阅读 · 2022年10月21日

Can Domains Be Transferred Across Languages in Multi-Domain Multilingual Neural Machine Translation?

Arxiv

0+阅读 · 2022年10月20日

What Do Compressed Multilingual Machine Translation Models Forget?

Arxiv

0+阅读 · 2022年10月20日

Machine Learning for $K$-adaptability in Two-stage Robust Optimization

Arxiv

0+阅读 · 2022年10月20日

Palm up: Playing in the Latent Manifold for Unsupervised Pretraining

Arxiv

0+阅读 · 2022年10月19日

Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Arxiv

0+阅读 · 2022年10月19日

A Linguistic Investigation of Machine Learning based Contradiction Detection Models: An Empirical Analysis and Future Perspectives

Arxiv

0+阅读 · 2022年10月19日

Unifying the Convergences in Multilingual Neural Machine Translation

Arxiv

0+阅读 · 2022年10月19日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

相关基金

靶向抑制 MNK-eIF4E 轴增效TRAIL治疗鼻咽癌的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

雄激素受体在膀胱癌进展中对GATA3的调控机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

HIPIMS制备环境自适应C/MoSx复合润滑薄膜的界面结构调控及摩擦机理

国家自然科学基金

0+阅读 · 2012年12月31日

N-3PUFA降低肥胖型胰岛素抵抗大鼠血脂及提高胰岛素敏感性的机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

第二相金属纳米颗粒掺杂对SrTiO3体系热电性能的调控研究

国家自然科学基金

0+阅读 · 2012年12月31日

LIMK1：罗格列酮抑制人胃癌细胞增殖、迁移及侵袭的作用靶点

国家自然科学基金

0+阅读 · 2012年12月31日

孔结构可控介孔沸石的合成及性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

Pim-3促进自噬对脓毒血症所致肾小管上皮细胞损伤的保护作用

国家自然科学基金

0+阅读 · 2012年12月31日

RGM与neogenin信号调控应激性精神障碍-PTSD杏仁核、海马神经细胞凋亡的分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员