Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.
翻译:衡量评价的现行做法侧重于单一数据集,例如每年WMT计量共享任务中的“新闻测试”数据集。然而,在本文件中,我们从质量和数量上表明,衡量指标的性能对数据敏感。当评价在不同数据集上进行时,衡量指标的排名各不相同。然后,本文件进一步调查了两种潜在假设,即微不足道的数据点以及独立和同样分布的假设(即d)的偏差,这可能对数据差异问题负责。最后,我们的调查结果建议,在评价自动翻译指标时,研究人员应考虑到数据差异,并谨慎地在单一数据集上要求结果,因为这可能导致结果与其他大多数数据集不一致。