In sequence prediction tasks like neural machine translation, training with cross-entropy loss often leads to models that overgeneralize and plunge into local optima. In this paper, we propose an extended loss function called \emph{dual skew divergence} (DSD) that integrates two symmetric terms on KL divergences with a balanced weight. We empirically discovered that such a balanced weight plays a crucial role in applying the proposed DSD loss into deep models. Thus we eventually develop a controllable DSD loss for general-purpose scenarios. Our experiments indicate that switching to the DSD loss after the convergence of ML training helps models escape local optima and stimulates stable performance improvements. Our evaluations on the WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss as a general and convenient mean for NMT training indeed brings performance improvement in comparison to strong baselines.
翻译:在神经机翻译等序列预测任务中,关于跨孔径损失的培训往往导致过度概括和跳入本地opima的模型。在本文中,我们提议一个称为\emph{doal skew difference}(DSD)的延长损失函数,将KL差异的两个对称术语与平衡重量结合起来。我们从经验中发现,这种平衡加权对于将拟议的DSD损失应用到深层模型中起着关键作用。因此,我们最终为通用情景开发了可控的DSD损失。我们的实验表明,在ML培训趋同后转至DSD损失有助于模型摆脱本地opima,刺激稳定的性能改进。我们对2014 WMT 英文、德文和英文-法文翻译任务的评估表明,拟议的损失作为NMT培训的一般和方便手段,确实比强的基线提高了绩效。