Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in real-world translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.
翻译:文档级神经机器翻译(NMT)在许多数据集上已经优于句子级NMT。然而,由于缺乏大规模通用领域的文档级NMT训练数据,文档级NMT仍然没有被广泛采用于实际的翻译系统。我们研究了使用 Paracrawl 学习文档级翻译的有效性。Paracrawl 是从互联网上爬取的大规模平行语料库,包含来自各个领域的数据。官方的 Paracrawl 语料库是由平行网页提取的平行句子,因此以前的研究仅在学习句子级翻译时使用了 Paracrawl。在这项工作中,我们使用自动句子对齐从 Paracrawl 平行网页中提取平行段落,并将提取的平行段落用作训练文档级翻译模型的平行文档。我们展示了只使用来自 Paracrawl 的平行段落训练的文档级NMT模型可用于翻译来自TED、新闻和欧洲议会的真实文档,并超越了句子级NMT模型。我们还进行了有针对性的代词评估,并展示了使用 Paracrawl 数据训练的文档级模型可以帮助上下文感知代词翻译。