Widely used evaluation metrics for text generation either do not work well with longer texts or fail to evaluate all aspects of text quality. In this paper, we introduce a new metric called SMART to mitigate such limitations. Specifically, We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Candidate sentences are also compared to sentences in the source documents to allow grounding (e.g., factuality) evaluation. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the SummEval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current model-based metrics. The latter does not use any neural model, which is useful during model development phases where resources can be limited and fast evaluation is required. Finally, we also conducted extensive analyses showing that our proposed metrics work well with longer summaries and are less biased towards specific models.
翻译:广泛使用的文本生成评价指标要么与较长的文本不起作用,要么无法评估文本质量的所有方面。在本文中,我们引入了称为SMART的新指标,以缓解这些限制。具体地说,我们把判决作为匹配的基本单位而不是象征性,并使用对软匹配候选人和参考判决的匹配功能。候选判决也与源文件中的判决进行比较,以便能够进行依据(例如事实质量)评估。我们的结果显示,我们拟议的指标与基于模型的匹配功能的系统层面相关性超过了SummEval总配对元评价数据集中所有相互竞争的指标,而同一基于字符串的匹配功能与目前基于模型的指数具有竞争力。后者不使用任何神经模型,这种模型在模型开发阶段有用,因为那里的资源有限,需要快速评估。最后,我们还进行了广泛的分析,表明我们拟议的指标与较长的概要很有效,而且不太偏向特定模型。