Label smoothing is ubiquitously applied in Neural Machine Translation (NMT) training. While label smoothing offers a desired regularization effect during model training, in this paper we demonstrate that it nevertheless introduces length biases in the beam search decoding procedure. Our analysis shows that label smoothing implicitly applies a length penalty term to output sequence, causing a bias towards shorter translations. We also show that for a model fully optimized with label smoothing, translation length is implicitly upper bounded by a fixed constant independent of input. We verify our theory by applying a simple rectification function at inference time to restore the unbiased distributions from the label-smoothed model predictions. This rectification method led to consistent quality improvements on WMT English-German, English-French, English-Czech and English-Chinese tasks, up to +0.3 BLEU at beam size 4 and +2.8 BLEU at beam size 200.
翻译:在神经机器翻译(NMT)培训中,无处不在地使用标签平滑。虽然标签平滑在模型培训中提供了理想的正规化效果,但我们在本文中表明,尽管在光束搜索解码程序中引入了长度偏差。我们的分析表明,光滑标签对输出序列暗地适用一个长的罚款条件,导致偏向于较短的翻译。我们还表明,对于一个完全优化且标签平滑的模型,翻译长度被一个固定不变的、不依赖输入的常数所隐含。我们通过在推论时间应用简单的校正功能来验证我们的理论,以恢复从标签摩擦模型预测中得出的不偏向分布。这种校正方法使WMT英语-德语、英语-法语、英语-捷克语和英语-中国语的任务质量得到一致的改进,波束大小为4和波束大小为200的+2.8 BLEU,最高为+0.3 BLEU。