Paraphrase generation is an important NLP task that has achieved significant progress recently. However, one crucial problem is overlooked, `how to evaluate the quality of paraphrase?'. Most existing paraphrase generation models use reference-based metrics (e.g., BLEU) from neural machine translation (NMT) to evaluate their generated paraphrase. Such metrics' reliability is hardly evaluated, and they are only plausible when there exists a standard reference. Therefore, this paper first answers one fundamental question, `Are existing metrics reliable for paraphrase generation?'. We present two conclusions that disobey conventional wisdom in paraphrasing generation: (1) existing metrics poorly align with human annotation in system-level and segment-level paraphrase evaluation. (2) reference-free metrics outperform reference-based metrics, indicating that the standard references are unnecessary to evaluate the paraphrase's quality. Such empirical findings expose a lack of reliable automatic evaluation metrics. Therefore, this paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality. BBScore consists of two sub-metrics: S3C score and SelfBLEU, which correspond to two criteria for paraphrase evaluation: semantic preservation and diversity. By connecting two sub-metrics, BBScore significantly outperforms existing paraphrase evaluation metrics.
翻译:然而,一个关键问题被忽略了,即“如何评价参数的质量?” ; 大多数现有的参数生成模型使用神经机翻译(NMT)的参考性指标(如BLEU)来评价其生成的参数。这类指标的可靠性几乎没有被评估,只有在存在标准参考标准时,才有说服力。因此,本文件首先回答一个基本问题,即“现有标准对参数生成值是否可靠?” 。我们提出两个结论,这些结论不符合常识: (1) 现有指标与系统级和部门级参数评价中的人类说明不相符。 (2) 无参考性指标高于基于基准的衡量标准,表明标准参考对于评估参数的质量是不必要的。这些经验发现缺乏可靠的自动评价指标。因此,本文提出“BBBScore,一种可反映生成参数质量的无参考性指标。 BBSC 包括两个次级参数:S3C分数和SOFSDR 标准,将现有标准与现有SBSDR标准相连接。