Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further observe that for text summarization, these metrics have high error rates when ranking current state-of-the-art abstractive summarization systems. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
翻译:我们发现,尽管最近取得了令人充满希望的成果,但有证据表明,关于总结和对话生成的无参考评价指标可能依赖与诸如单词重叠、不易理解和长度等措施的虚假关联。我们还注意到,就文本汇总而言,在排列目前最先进的抽象总结系统时,这些指标的误差率很高。我们证明,通过明确设计评价指标以避免无参考评价的虚假特征,这些误差是可以减轻的。