Text summarization models are often trained to produce summaries that meet human quality requirements. However, the existing evaluation metrics for summary text are only rough proxies for summary quality, suffering from low correlation with human scoring and inhibition of summary diversity. To solve these problems, we propose SummScore, a comprehensive metric for summary quality evaluation based on CrossEncoder. Firstly, by adopting the original-summary measurement mode and comparing the semantics of the original text, SummScore gets rid of the inhibition of summary diversity. With the help of the text-matching pre-training Cross-Encoder, SummScore can effectively capture the subtle differences between the semantics of summaries. Secondly, to improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels, which measure Coherence, Consistency, Fluency, and Relevance separately. We use semi-supervised multi-rounds of training to improve the performance of our model on extremely limited annotated data. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring. We also provide the quality evaluation results of SummScore on 16 mainstream summarization models for later research.
翻译:文本总和模型往往经过培训,以编写符合人类质量要求的摘要;然而,现有的摘要文本评价指标只是摘要质量的粗略替代物,与人的评分低相关,抑制了摘要多样性;为解决这些问题,我们提议SumScore,这是基于CrossEncoder的简要质量评价综合指标。首先,采用原始摘要计量模式,比较原始案文的语义,SumScore就消除了对摘要多样性的抑制。但是,在培训前跨Encoder的文本匹配前,SumScore能够有效地捕捉摘要语义之间的微妙差异。第二,为了改进全面性和可解释性,SumScore由四个微细微细的子模型组成,分别衡量一致性、一致性、流动性和相关性。我们使用半超强的多轮培训来改进我们关于极为有限的附加说明数据模型的性能。广泛实验显示,SumScrecore可以有效地捕捉到摘要语义的细微差异。第二,为了改进全面性和可解释性,SumSqual 16级模型与人类测量结果的比重。