Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation only relies on exact matching with human references and disregards reference-free attributes. This scheme fails to recognize systems that generate keyphrases that are semantically equivalent to the references or keyphrases that have practical utility. To better understand the strengths and weaknesses of different keyphrase systems, we propose a comprehensive evaluation framework consisting of six critical dimensions: naturalness, faithfulness, saliency, coverage, diversity, and utility. For each dimension, we discuss the desiderata and design semantic-based metrics that align with the evaluation objectives. Rigorous meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using this framework, we re-evaluate 18 keyphrase systems and further discover that (1) the best model differs in different dimensions, with pre-trained language models achieving the best in most dimensions; (2) the utility in downstream tasks does not always correlate well with reference-based metrics; and (3) large language models exhibit a strong performance in reference-free evaluation.
翻译:机器人翻译-ABSTRACT
尽管关键短语提取和生成方法取得了显著进展,但主流的评估方法仅依赖与人类参考的精确匹配,并忽略了无参考属性。这种方案无法识别出生成与参考相等的语义等效的关键短语或具有实用性的关键短语。为了更好地了解不同关键短语系统的优缺点,我们提出了一个综合的评估框架,包括六个关键维度: 自然度、忠实度、显著性、覆盖率、多样性和实用性。针对每个维度,我们讨论理想情况,设计基于语义的度量标准,符合评估目标。严格的元评估研究表明,我们的评估策略比之前使用的各种指标更好地与人类偏好相关。使用这个框架,我们重新评估了18个关键短语系统,并发现(1)最好的模型在不同的维度上有所不同,预训练语言模型在大多数维度上取得最佳效果;(2)下游任务中的实用性并不总是与基于参考的指标相关;(3)大型语言模型在无参考评估中表现出很强的性能。