Machine translation systems are vulnerable to domain mismatch, especially in a low-resource scenario. Out-of-domain translations are often of poor quality and prone to hallucinations, due to exposure bias and the decoder acting as a language model. We adopt two approaches to alleviate this problem: lexical shortlisting restricted by IBM statistical alignments, and hypothesis re-ranking based on similarity. The methods are computationally cheap, widely known, but not extensively experimented on domain adaptation. We demonstrate success on low-resource out-of-domain test sets, however, the methods are ineffective when there is sufficient data or too great domain mismatch. This is due to both the IBM model losing its advantage over the implicitly learned neural alignment, and issues with subword segmentation of out-of-domain words.
翻译:机器翻译系统容易出现域际不匹配,特别是在低资源情况下。由于暴露偏差和作为语言模型的解码器,外部翻译往往质量差,容易产生幻觉。我们采取了两种办法来缓解这一问题:受IBM统计调整限制的词汇短名单和基于相似性的假设重新排序。这些方法在计算上是廉价的,广为人知,但在域外适应方面没有进行广泛的实验。然而,当数据充足或域外错配太大时,在低资源测试组中,方法是无效的。这是因为IBM模型在隐性学习的内线调整方面失去优势,以及外语子字块分割问题。