Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT). While recent works leverage various word embedding approaches to differentiate word sense in NMT, they do not focus on the pivotal components in resolving ambiguities of homographs in NMT: the hidden states of an encoder. In this paper, we propose a novel approach to tackle homographic issues of NMT in the latent space. We first train an encoder (aka "HDR-encoder") to learn universal sentence representations in a natural language inference (NLI) task. We further fine-tune the encoder using homograph-based synset sentences from WordNet, enabling it to learn word-level homographic disambiguation representations (HDR). The pre-trained HDR-encoder is subsequently integrated with a transformer-based NMT in various schemes to improve translation accuracy. Experiments on four translation directions demonstrate the effectiveness of the proposed method in enhancing the performance of NMT systems in the BLEU scores (up to +2.3 compared to a solid baseline). The effects can be verified by other metrics (F1, precision, and recall) of translation accuracy in an additional disambiguation task. Visualization methods like heatmaps, T-SNE and translation examples are also utilized to demonstrate the effects of the proposed method.
翻译:同形词(Homographs),即拼写相同但含义不同的单词,对于神经机器翻译(NMT)来说仍然是具有挑战性的。尽管最近的研究利用了各种单词嵌入方法来区分NMT中的词义,但它们并没有关注解决NMT中同形词歧义的关键组件:编码器的隐藏状态。在本文中,我们提出了一种新颖的方法来在NMT的潜在空间中解决同形问题。我们首先训练一个编码器(即“HDR-编码器”)在自然语言推理(NLI)任务中学习通用句子表示。我们进一步利用WordNet中基于同形词的同义句对HDR-编码器进行微调,使其学习单词级同形消歧表示(HDR)。预训练的HDR-编码器随后与基于transformer的NMT以各种方案集成,以提高翻译准确性。对四个翻译方向的实验表明,所提出的方法在提高NMT系统BLEU得分方面具有有效性(与坚实的基线相比,可高达+2.3)。在附加的消歧任务中,也可以通过其他翻译准确性指标(F1、精确度和召回率)来验证效果。可视化方法如热图、T-SNE和翻译示例也被用于展示所提出方法的效果。