Minimum Bayesian Risk Decoding (MBR) emerges as a promising decoding algorithm in Neural Machine Translation. However, MBR performs poorly with label smoothing, which is surprising as label smoothing provides decent improvement with beam search and improves generality in various tasks. In this work, we show that the issue arises from the un-consistency of label smoothing on the token-level and sequence-level distributions. We demonstrate that even though label smoothing only causes a slight change in the token-level, the sequence-level distribution is highly skewed. We coin the issue \emph{distributional over-smoothness}. To address this issue, we propose a simple and effective method, Distributional Cooling MBR (DC-MBR), which manipulates the entropy of output distributions by tuning down the Softmax temperature. We theoretically prove the equivalence between pre-tuning label smoothing factor and distributional cooling. Experiments on NMT benchmarks validate that distributional cooling improves MBR's efficiency and effectiveness in various settings.
翻译:在神经机器翻译中,最低巴伊斯风险解码算法(MBR)作为一种有希望的解码算法出现。然而,MBR在标签平滑方面表现不佳,这令人惊讶,因为标签平滑通过光束搜索提供了体面的改进,提高了各种任务的一般性。在这项工作中,我们显示,问题出自于象征级别和顺序级别分布的标签平滑不一致。我们证明,即使标签平滑只能造成象征性水平的微小变化,但序列级别分布高度偏斜。我们发现了问题。为了解决这个问题,我们提出了一个简单有效的方法,即分布式冷却 MBR(DC-MBR),它通过调低软体温度来操纵输出分布的酶。我们从理论上证明预调标签平滑系数和分布冷却之间的等值。关于NMT基准的实验证实,分配式冷却提高了不同环境的MSR的效率和效力。