Checkpoint averaging is a simple and effective method to boost the performance of converged neural machine translation models. The calculation is cheap to perform and the fact that the translation improvement almost comes for free, makes it widely adopted in neural machine translation research. Despite the popularity, the method itself simply takes the mean of the model parameters from several checkpoints, the selection of which is mostly based on empirical recipes without many justifications. In this work, we revisit the concept of checkpoint averaging and consider several extensions. Specifically, we experiment with ideas such as using different checkpoint selection strategies, calculating weighted average instead of simple mean, making use of gradient information and fine-tuning the interpolation weights on development data. Our results confirm the necessity of applying checkpoint averaging for optimal performance, but also suggest that the landscape between the converged checkpoints is rather flat and not much further improvement compared to simple averaging is to be obtained.
翻译:平均检查点是一个简单而有效的方法,可以提高神经机翻译模型的性能。计算成本低廉,翻译改进几乎免费,在神经机翻译研究中被广泛采用。尽管受欢迎程度高,但该方法本身只是从几个检查站中采用模型参数的平均值,这些参数的选择大多以经验食谱为基础,没有很多理由。在这项工作中,我们重新审视平均检查点的概念并考虑若干扩展。具体地说,我们尝试使用不同的检查站选择战略,计算加权平均数而不是简单平均数,利用梯度信息并微调发展数据的内推权重。我们的结果证实,为了最佳性能,必须使用平均检查站,但也表明,与简单平均数相比,统一检查站之间的景观相当平坦,没有多大进一步的改进。