用于机器翻译的特定域文本生成 (Domain-Specific Text Generation for Machine Translation)

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.

翻译：在任何翻译工作流程中,从源源到目标的域知识保护至关重要。翻译行业通常都会接受高度专业化的项目,因为那里几乎没有平行的内地数据。在这种情况下,如果没有足够的内部数据来微调机器翻译模型(MT)模型,则产生符合相关背景的翻译是具有挑战性的。在这项工作中,我们提出一种新的办法,利用最新的、最先进的、经过预先培训的语言模型(LMS)进行适应,为MT提供具体领域的数据增强,模拟(a) 一个小型的双语数据集或(b) 需要翻译的单语文文本的域特性。在这种想法与回译相结合的情况下,我们可以产生大量合成的双语内部数据,供两个案件使用。我们的调查使用最新版的变换器结构。我们采用混合的微调方法,以培训大大改进内部文本翻译的模型。更具体地说,在两种情况下,我们提出的方法都分别改进了(a) 5-6 BLEU 和 2-3 BLEU,分别改进了阿拉伯-英语和英语的自动结果。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日