Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.
翻译:可靠的评价程序对于复制NLP研究至关重要。 在这项工作中,我们表明,有时衡量或常规人类评价都不足以得出关于系统业绩的结论。我们以压缩句子为例,展示一个系统如何利用已经建立起来的数据集来取得最新结果。与以往工作所报告的显示人类判断和衡量分数之间相互关系的结果相反,我们对最新系统产出的人工分析表明,高衡量分数可能只能表明更适合数据,而不是人类所认为的更好的产出。