Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that the models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. We release both code and multiple human reference translations for two popular benchmarks.
翻译:机器翻译是神经序列到序列模型研究的流行试验床,但尽管最近进行了许多研究,仍然对这些模型缺乏了解。从业者以大梁表示性能退化,对稀有词数估计不足,最后译文缺乏多样性。我们的研究将其中一些问题与任务固有的不确定性联系在一起,因为单一来源句存在多种有效的翻译,以及由于紧张的培训数据造成的外部不确定性。我们提出了各种工具和衡量标准,以评估模型的分布如何捕获数据的不确定性,以及如何影响产生翻译的搜索战略。我们的结果显示,搜索工作效果非常好,但模型往往在假设空间上传播太多概率。接下来,我们提出评估模型校准的工具,并展示如何轻而易举地纠正当前模型的某些缺陷。我们为两个流行基准发布了代码和多个人类参考翻译。