There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies, which leads to different learning difficulties for tokens in Neural Machine Translation (NMT). The vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies and tends to generate more high-frequency tokens and less low-frequency tokens compared with the golden token distribution. However, low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected. In this paper, we explored target token-level adaptive objectives based on token frequencies to assign appropriate weights for each target token during training. We aimed that those meaningful but relatively low-frequency words could be assigned with larger weights in objectives to encourage the model to pay more attention to these tokens. Our method yields consistent improvements in translation quality on ZH-EN, EN-RO, and EN-DE translation tasks, especially on sentences that contain more low-frequency tokens where we can get 1.68, 1.02, and 0.52 BLEU increases compared with baseline, respectively. Further analyses show that our method can also improve the lexical diversity of translation.
翻译:自然语言中存在着一种象征性的不平衡现象,因为不同频率不同,不同象征出现,导致神经机器翻译(NMT)的象征品学习困难不同。香草NMT模型通常对不同频率的目标象征物采用微不足道的同等重量目标,并往往产生更多的高频率象征物和低频率象征物,但与黄金象征物分布相比,低频率象征物往往产生更多的高频率象征物和低频率象征物;然而,低频率象征物可能含有关键语义信息,一旦被忽略,就会影响翻译质量。在本文件中,我们探讨了基于象征频率的目标象征性适应目标,以便在培训期间为每个目标象征物分配适当的重量。我们的目标是,那些有意义但相对低频率的字句在目标中可以分配更大的重量,以鼓励模型更多地关注这些象征物。我们的方法使得ZH-EN、EN-RO和EN-DE翻译工作的质量不断提高,特别是含有更低频率象征物的句子,我们可以在那里分别得到1.68、1.02和0.52 BLEU值的增加。进一步的分析表明,我们的方法还可以改进翻译的词汇多样性。