The advent of deep learning has led to a significant gain in machine translation. However, most of the studies required a large parallel dataset which is scarce and expensive to construct and even unavailable for some languages. This paper presents a simple yet effective method to tackle this problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences derived from the model. We also introduce a SentenceBERT-based filter to enhance the quality of augmenting data by retaining semantically similar sentence pairs. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46--2.03 BLEU scores. We also demonstrate that using unsupervised training for augmented data is more efficient than reusing the ground-truth target sentences for supervised learning.
翻译:深度学习的出现在机器翻译领域带来了重大的进展。然而,对于一些语言,大量平行数据的获取成本高昂,甚至不可用。本文提出了一种简单而有效的方法,通过扩充高质量的句子对,并以半监督的方式训练NMT模型,解决低资源语言的问题。具体而言,我们的方法结合了交叉熵损失和KL散度,用于监督式学习和无监督式学习,给定从模型派生的伪和扩充的目标句子。我们还引入了基于SentenceBERT的过滤器,通过保留语义相似的句子对来增强扩充数据的质量。实验结果表明,我们的方法显著提高了NMT的基线,尤其是在0.46-2.03 BLEU得分的低资源数据集上。我们还展示了使用无监督训练来增强数据的效率,比使用真实目标句子进行监督学习更高效。