Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
翻译:人类评价是评估总结系统和自动计量的基础。然而,现有的人类评价规程和总评基准要么显示低份份际协议,要么缺乏得出具有统计意义的结论所需的规模,缺乏对人的评价的深入分析。在这项工作中,我们根据以下轴线解决现有总估评估的缺陷:1)我们建议修改总估显著协议,即原子内容单位,它依赖于精细的语义单位,并允许达成高份间协议。 2)我们制定Robust 总结评价基准,这是一个庞大的人类评价数据集,由对三个数据集中的最新系统22公里以上的简要说明组成。 3)我们将我们的ACU协议与其他三项人类评价协议进行比较,强调评价组合中的潜在基本因素。 4)我们利用各种评价协议收集的人类说明对现有自动计量进行评估,并表明我们的基准如何导致更统计上稳定和重要的结果。 此外,我们的调查结果对评价大型语言3级评价模式(LMS)具有重要影响,我们通过事先调整的GMS的反馈方法,对大型语言模型(LMS)产生影响。