A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary's information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval out-performs current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.
翻译:衡量摘要内容质量的基于参考的评价指标的一个可取的属性是,它应估计摘要具有多少共同之处,并参照其中的参考内容; 传统文本重叠的基于指标,如ROUGE等传统文本的基于指标没有做到这一点,因为它们限于在法律上或通过嵌入方式对象征物进行匹配; 在这项工作中,我们提出一种衡量标准,用问答方法评价摘要的内容质量; 以质量A为基础的方法直接衡量摘要的信息与参考内容重叠,使其与文本重叠指标有根本的不同; 我们通过分析我们拟议的衡量标准,即QAEval. QAEval. QAAEval. QAEAval在使用基准数据集进行的大多数评价方面超越目前的最新指标,同时由于最新模型的局限性而在其他方面具有竞争力; 我们通过仔细分析定量评估的每个组成部分,查明其业绩瓶颈,并估计其潜在的上限性能超过所有其他自动指标,接近金质标的Pyramid方法。