Canonical automatic summary evaluation metrics, such as ROUGE, suffer from two drawbacks. First, semantic similarity and linguistic quality are not captured well. Second, a reference summary, which is expensive or impossible to obtain in many cases, is needed. Existing efforts to address the two drawbacks are done separately and have limitations. To holistically address them, we introduce an end-to-end approach for summary quality assessment by leveraging sentence or document embedding and introducing two negative sampling approaches to create training data for this supervised approach. The proposed approach exhibits promising results on several summarization datasets of various domains including news, legislative bills, scientific papers, and patents. When rating machine-generated summaries in TAC2010, our approach outperforms ROUGE in terms of linguistic quality, and achieves a correlation coefficient of up to 0.5702 with human evaluations in terms of modified pyramid scores. We hope our approach can facilitate summarization research or applications when reference summaries are infeasible or costly to obtain, or when linguistic quality is a focus.
翻译:首先,语义相似性和语言质量没有很好地记录。第二,需要一份参考摘要,许多情况下成本昂贵或无法获得,需要一份参考摘要。现有的解决这两个缺点的努力是分开进行的,并且有局限性。为了整体地解决这些问题,我们采用了一种端对端办法,通过利用判决或文件嵌入和采用两种负面抽样办法来进行简要质量评估,以便为这一监督办法创建培训数据。拟议办法在包括新闻、立法法案、科学论文和专利在内的多个领域汇总数据集中显示了有希望的结果。在2010年TAC的评级机器生成摘要在语言质量方面优于ROGE,在修改金字塔分数方面达到与人类评价的0.5702相关系数。我们希望我们的办法能够在参考摘要不可行或费用高的情况下,或者在语言质量是重点时,便利进行汇总研究或应用。