While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
翻译:虽然人类评价仍然是准确判断自动生成摘要是否忠实的最佳做法,但在评价长式摘要时,解决难度和工作量增加的难度和工作量增加的解决方案很少。通过对162份长式摘要调查,我们首先揭示了长式摘要的当前人类评价做法。我们发现,73%的此类文件没有对模型生成摘要进行任何人的评价,而其他作品在处理长式文件时则面临新的困难(例如,低份份间协议),但我们的调查激励下,我们提出了长式摘要中一套关于长式摘要中人对忠诚性作出评价的准则,该准则针对以下挑战:(1) 我们如何在忠实度分数方面达成高份内部通知协议?(2) 我们如何在保持准确性分数的同时最大限度地减少说明性工作量?(3) 人类从摘要和来源片断之间的自动调整中受益?我们在不同领域(例如,低份间协议)对两个长式合成数据集进行了说明性说明性研究(Squality和PubMed),我们发现,从精细的直径直的直径直径直到直径直径直直径直径直径直的直径直判(例如直判),也降低了直径直判。