In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.
翻译:在本文中,我们介绍DOCMT5, 一种多语种顺序到顺序的语文模式,先经过大规模平行文件培训,先经过大规模培训,先经过多语种、顺序和顺序的语文模式。虽然以前的做法侧重于利用判决一级的平行数据,但我们试图建立一个通用的预先培训模式,能够理解和生成长文件。我们提出了一个简单有效的培训前目标――文件重新排序机器翻译(DrMT),其中需要翻译被打乱和蒙面的输入文件。DrMT为各种文件级的生成任务带来了持续改进,包括12个以上的可见语言文件级MT的BLEU点、7个未读语言文件级MT和3个ROUGE-1点。我们在WMT20 De-En和IWSLT15 Zh-En文件翻译任务方面达到了最新水平(SOTA)。我们还广泛分析各种文件前培训因素,包括:(1) 培训前数据质量模型的影响,以及(2) 我们将单语前训练计划综合到公开训练前检查的效果。