Prior works have demonstrated that a low-resource language pair can benefit from multilingual machine translation (MT) systems, which rely on many language pairs' joint training. This paper proposes two simple strategies to address the rare word issue in multilingual MT systems for two low-resource language pairs: French-Vietnamese and English-Vietnamese. The first strategy is about dynamical learning word similarity of tokens in the shared space among source languages while another one attempts to augment the translation ability of rare words through updating their embeddings during the training. Besides, we leverage monolingual data for multilingual MT systems to increase the amount of synthetic parallel corpora while dealing with the data sparsity problem. We have shown significant improvements of up to +1.62 and +2.54 BLEU points over the bilingual baseline systems for both language pairs and released our datasets for the research community.
翻译:先前的著作表明,低资源语言配对可以受益于多语种机器翻译系统(MT),该系统依赖许多对语言的联合培训。本文提出两个简单的战略,以解决两种低资源语言配对的多语言MT系统中的稀有字问题:法语-越南语和英语-越南语。第一个战略是源语言共享空间的象征物动态学习用词相似,而另一个战略则试图通过在培训期间更新其嵌入内容来增加稀有文字的翻译能力。此外,我们利用多种语言MT系统的单语数据来增加合成平行子公司的数量,同时处理数据广度问题。我们已经在双语基线系统中为两种语言配对提供了高达+1.62和+2.54 BLEU点的重大改进,并为研究界发布了我们的数据集。