Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries' linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.
翻译:手册评价对于判断自动文本摘要的进展至关重要。然而,我们对最近的汇总系统文件进行调查,显示对如何进行这种评价研究没有取得多少一致。我们对摘要语言质量的两个方面(一致性和重复性)进行了两次评价试验,比较类似类型和分级说明,显示评价方法的最佳选择可能因一个方面而异。在我们的调查中,我们还发现研究参数,例如通知员总数和通知员对说明项目的分配,往往没有全面报告,随后的统计分析忽略了从一个评分员判断多个摘要所产生的分类因素。我们通过评价试验,我们表明,批注者总数可对研究能力产生很大影响,目前的统计分析方法可将I型误差率提高到8倍。此外,我们强调,为了系统的目的,为了比较目前对每个摘要作出多重判断的做法,从固定的研究预算来看,从获得较弱和可靠的说明。