Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast development of new methods. Thus, numerous research efforts have focused on crafting such metrics. In this work, we take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics altogether. As metrics are used based on how they rank systems, we compare metrics in the space of system rankings. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans. Automatic metrics are not complementary and rank systems similarly. Strikingly, human metrics predict each other much better than the combination of all automatic metrics used to predict a human metric. It is surprising because human metrics are often designed to be independent, to capture different aspects of quality, e.g. content fidelity or readability. We provide a discussion of these findings and recommendations for future work in the field of evaluation.
翻译:能够取代人类判断的自动评价衡量标准对于快速开发新方法至关重要。 因此, 众多的研究努力都集中在设计这类衡量标准上。 在这项工作中, 我们退一步, 分析最近的进展, 比较现有的自动衡量标准和人类衡量标准, 将现有的自动衡量标准和人类衡量标准进行整体比较。 随着衡量标准的使用基于它们如何排位系统, 我们比较了系统排位空间的衡量标准。 我们广泛的统计分析揭示出令人惊讶的调查结果: 自动衡量标准 -- -- 旧的和新的 -- -- 与人类相近得多。 自动衡量标准不是相辅相成的,等级系统也不同。 惊人的是, 人类衡量标准预测的相互效果比所有用于预测人类衡量标准的自动衡量标准组合要好得多。 令人惊讶的是, 人类衡量标准往往是独立的, 能够捕捉质量的不同方面, 例如内容的忠诚或可读性。 我们对这些调查结果和今后评价工作的建议进行了讨论。