利用不受监督的神经机器翻译课程学习 (Exploiting Curriculum Learning in Unsupervised Neural Machine Translation)

Back-translation (BT) has become one of the de facto components in unsupervised neural machine translation (UNMT), and it explicitly makes UNMT have translation ability. However, all the pseudo bi-texts generated by BT are treated equally as clean data during optimization without considering the quality diversity, leading to slow convergence and limited translation performance. To address this problem, we propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities. Specifically, we first apply cross-lingual word embedding to calculate the potential translation difficulty (quality) for the monolingual sentences. Then, the sentences are fed into UNMT from easy to hard batch by batch. Furthermore, considering the quality of sentences/tokens in a particular batch are also diverse, we further adopt the model itself to calculate the fine-grained quality scores, which are served as learning factors to balance the contributions of different parts when computing loss and encourage the UNMT model to focus on pseudo data with higher quality. Experimental results on WMT 14 En-Fr, WMT 16 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.

翻译：后译已成为不受监督的神经机器翻译(UNMT)中事实上的组成部分之一,它明确使UNMT具有翻译能力;然而,BT产生的所有伪双文本在优化期间都被视为清洁数据,而不考虑质量多样性,导致趋同速度和有限的翻译性能;为解决这一问题,我们提议了一个课程学习方法,以逐步利用基于多重颗粒质量的伪双文本。具体地说,我们首先应用跨语言词嵌入计算单语判决的潜在翻译困难(质量)。然后,将判决从轻到重分批地输入UNMTT。此外,考虑到某一批次判决/口的质量也各不相同,我们进一步采用模型本身来计算精细质量分数,作为学习因素,平衡不同部分在计算损失时的贡献,并鼓励UNMT模型注重质量更高的伪数据。WMT 14 E-Fr、WMT 16 E-De、WMT 16 En-Ro和LDC En-Zh翻译工作,以更快的速度改进拟议方法实现一致。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【经典书】机器学习白话书，97页pdf，Machine Learning for Humans

专知会员服务

87+阅读 · 2021年1月11日