This paper discusses two existing approaches to the correlation analysis between automatic evaluation metrics and human scores in the area of natural language generation. Our experiments show that depending on the usage of a system- or sentence-level correlation analysis, correlation results between automatic scores and human judgments are inconsistent.
翻译:本文件讨论了在自然语言生成领域自动评价指标和人类得分之间进行相关分析的两个现行办法。我们的实验表明,根据系统或判决一级相关分析的使用情况,自动评价指标和人类判断之间的相关结果不一致。