重新审视发源评价指标 (Revisiting the Evaluation Metrics of Paraphrase Generation)

Paraphrase generation is an important NLP task that has achieved significant progress recently. However, one crucial problem is overlooked, `how to evaluate the quality of paraphrase?'. Most existing paraphrase generation models use reference-based metrics (e.g., BLEU) from neural machine translation (NMT) to evaluate their generated paraphrase. Such metrics' reliability is hardly evaluated, and they are only plausible when there exists a standard reference. Therefore, this paper first answers one fundamental question, `Are existing metrics reliable for paraphrase generation?'. We present two conclusions that disobey conventional wisdom in paraphrasing generation: (1) existing metrics poorly align with human annotation in system-level and segment-level paraphrase evaluation. (2) reference-free metrics outperform reference-based metrics, indicating that the standard references are unnecessary to evaluate the paraphrase's quality. Such empirical findings expose a lack of reliable automatic evaluation metrics. Therefore, this paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality. BBScore consists of two sub-metrics: S3C score and SelfBLEU, which correspond to two criteria for paraphrase evaluation: semantic preservation and diversity. By connecting two sub-metrics, BBScore significantly outperforms existing paraphrase evaluation metrics.

翻译：然而,一个关键问题被忽略了,即“如何评价参数的质量?” ; 大多数现有的参数生成模型使用神经机翻译(NMT)的参考性指标(如BLEU)来评价其生成的参数。这类指标的可靠性几乎没有被评估,只有在存在标准参考标准时,才有说服力。因此,本文件首先回答一个基本问题,即“现有标准对参数生成值是否可靠?” 。我们提出两个结论,这些结论不符合常识: (1) 现有指标与系统级和部门级参数评价中的人类说明不相符。 (2) 无参考性指标高于基于基准的衡量标准,表明标准参考对于评估参数的质量是不必要的。这些经验发现缺乏可靠的自动评价指标。因此,本文提出“BBBScore,一种可反映生成参数质量的无参考性指标。 BBSC 包括两个次级参数:S3C分数和SOFSDR 标准,将现有标准与现有SBSDR标准相连接。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日