Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).
翻译:在机器学习和自然语言处理(NLP)中,可追溯性是最令人关切的问题。在自然语言生成领域(特别是机器翻译),Post(2018)的开创性论文(2018)指出在出版时主要指标BLEU的可复制性问题。如今,基于BERT的评价指标大大优于BLEU。在本文件中,我们问是否可以复制最近四项基于BERT的衡量标准的结果和索赔要求。我们发现,索赔要求和结果的复制往往失败,因为(一) 指标所涉的大规模无记录预处理,(二) 缺失的编码,(三) 报告基线衡量标准的结果较差。 (四) 在一个案例中,问题的根源不是与人的得分相关,而是与csv档案中的错误列有关,使得分膨胀5个百分点。受预处理的影响,我们接着进行一项研究,我们更密切地研究其影响(其中一项指标)。我们发现,预处理可能产生很大的影响,特别是对高感官语言的影响。在本案中,前处理的效果可能大于程度机制。