DOCmT5:多语种语言模式的文件预培训 (DOCmT5: Document-Level Pretraining of Multilingual Language Models) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · BLEU · Extensibility · 机器翻译 ·

2021 年 12 月 16 日

DOCmT5: Document-Level Pretraining of Multilingual Language Models

翻译：DOCmT5:多语种语言模式的文件预培训

Chia-Hsuan Lee,Aditya Siddhant,Viresh Ratnakar,Melvin Johnson

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.

翻译：在本文中,我们介绍DOCMT5, 一种多语种顺序到顺序的语文模式,先经过大规模平行文件培训,先经过大规模培训,先经过多语种、顺序和顺序的语文模式。虽然以前的做法侧重于利用判决一级的平行数据,但我们试图建立一个通用的预先培训模式,能够理解和生成长文件。我们提出了一个简单有效的培训前目标――文件重新排序机器翻译(DrMT),其中需要翻译被打乱和蒙面的输入文件。DrMT为各种文件级的生成任务带来了持续改进,包括12个以上的可见语言文件级MT的BLEU点、7个未读语言文件级MT和3个ROUGE-1点。我们在WMT20 De-En和IWSLT15 Zh-En文件翻译任务方面达到了最新水平(SOTA)。我们还广泛分析各种文件前培训因素,包括:(1) 培训前数据质量模型的影响,以及(2) 我们将单语前训练计划综合到公开训练前检查的效果。

0

相关内容

语言模型化

语言模型化

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

专知会员服务

24+阅读 · 2020年9月15日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

机器翻译深度学习最新综述

机器翻译深度学习最新综述

专知会员服务

99+阅读 · 2020年2月20日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

专知会员服务

20+阅读 · 2020年1月7日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

【多语言模型跨任务嵌入投影】Multilingual model using cross-task embedding projection，CoNLL 2019，附论文和PPT免费下载

【多语言模型跨任务嵌入投影】Multilingual model using cross-task embedding projection，CoNLL 2019，附论文和PPT免费下载

专知会员服务

10+阅读 · 2019年11月4日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

清华大学NLP组整理的机器翻译论文阅读清单

清华大学NLP组整理的机器翻译论文阅读清单

AINLP

5+阅读 · 2018年12月29日

已删除

将门创投

4+阅读 · 2018年7月31日

机器翻译 | Bleu：此蓝;非彼蓝

机器翻译 | Bleu：此蓝;非彼蓝

黑龙江大学自然语言处理实验室

4+阅读 · 2018年3月14日

【论文推荐】最新六篇机器翻译相关论文—综述、卷积Encoder-Decoder神经网络、字翻译、自编码器、神经短语、RNNs

【论文推荐】最新六篇机器翻译相关论文—综述、卷积Encoder-Decoder神经网络、字翻译、自编码器、神经短语、RNNs

专知

6+阅读 · 2018年2月19日

XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge

Arxiv

7+阅读 · 2021年12月26日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Arxiv

3+阅读 · 2021年6月11日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Pre-trained Language Model Representations for Language Generation

Arxiv

5+阅读 · 2019年4月1日

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Arxiv

7+阅读 · 2019年2月3日

Improving the Transformer Translation Model with Document-Level Context

Arxiv

4+阅读 · 2018年10月8日

Phrase-Based & Neural Unsupervised Machine Translation

Arxiv

4+阅读 · 2018年4月20日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

专知会员服务

24+阅读 · 2020年9月15日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

机器翻译深度学习最新综述

机器翻译深度学习最新综述

专知会员服务

99+阅读 · 2020年2月20日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

专知会员服务

20+阅读 · 2020年1月7日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

【多语言模型跨任务嵌入投影】Multilingual model using cross-task embedding projection，CoNLL 2019，附论文和PPT免费下载

【多语言模型跨任务嵌入投影】Multilingual model using cross-task embedding projection，CoNLL 2019，附论文和PPT免费下载

专知会员服务

10+阅读 · 2019年11月4日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【伯克利博士论文】通过真实世界实践赋能机器人自主性

军用无人机集群技术尚未成熟——但潜力可期

人工智能安全治理白皮书（2025）

AgentOps综述：分类、挑战与未来方向

相关资讯

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

清华大学NLP组整理的机器翻译论文阅读清单

清华大学NLP组整理的机器翻译论文阅读清单

AINLP

5+阅读 · 2018年12月29日

已删除

将门创投

4+阅读 · 2018年7月31日

机器翻译 | Bleu：此蓝;非彼蓝

机器翻译 | Bleu：此蓝;非彼蓝

黑龙江大学自然语言处理实验室

4+阅读 · 2018年3月14日

【论文推荐】最新六篇机器翻译相关论文—综述、卷积Encoder-Decoder神经网络、字翻译、自编码器、神经短语、RNNs

【论文推荐】最新六篇机器翻译相关论文—综述、卷积Encoder-Decoder神经网络、字翻译、自编码器、神经短语、RNNs

专知

6+阅读 · 2018年2月19日

相关论文

XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge

Arxiv

7+阅读 · 2021年12月26日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Arxiv

3+阅读 · 2021年6月11日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Pre-trained Language Model Representations for Language Generation

Arxiv

5+阅读 · 2019年4月1日

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Arxiv

7+阅读 · 2019年2月3日

Improving the Transformer Translation Model with Document-Level Context

Arxiv

4+阅读 · 2018年10月8日

Phrase-Based & Neural Unsupervised Machine Translation

Arxiv

4+阅读 · 2018年4月20日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

微信扫码咨询专知VIP会员