Existing curriculum learning approaches to Neural Machine Translation (NMT) require sampling sufficient amounts of "easy" samples from training data at the early training stage. This is not always achievable for low-resource languages where the amount of training data is limited. To address such limitation, we propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples. Specifically, the model learns to predict a short sub-sequence from the beginning part of each target sentence at the early stage of training, and then the sub-sequence is gradually expanded as the training progresses. Such a new curriculum design is inspired by the cumulative effect of translation errors, which makes the latter tokens more difficult to predict than the beginning ones. Extensive experiments show that our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages. Combining our approach with sentence-level methods further improves the performance on high-resource languages.
翻译:现有神经机器翻译课程学习方法要求在早期培训阶段从培训数据中抽取足够数量的“容易”样本,对于培训数据数量有限的低资源语言来说,这并不总是可以实现的。为解决这种局限性,我们建议采用新的象征性课程学习方法,以创造足够数量的简易样本。具体地说,模型学会在培训的早期阶段从每个目标句子的开头部分预测一个简短的次序列,然后随着培训的进展,次序列逐渐扩大。这种新的课程设计受到翻译错误的累积效应的启发,使后者的代号比最初的代号更难预测。广泛的实验表明,我们的方法可以始终超过5种语言的基线,特别是低资源语言的基线。将我们的方法与判决级方法结合起来,可以进一步提高高资源语言的绩效。