Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.
翻译:举例说,ROUGE这样的典型的自动简要评价指标侧重于无法很好地捕捉语义或语言质量的词汇相似性,需要一份成本高昂的参考摘要。最近,为缓解两个缺陷的任何一个或两个缺陷做出了越来越多的努力。在本文中,我们提出了一个概念证明研究,以证明监督不力的简要评价方法,而没有参考摘要。现有汇总数据集中的大规模数据通过将文件与腐败的参考摘要配对而转换为培训。在跨域测试中,我们的战略比基线的改进有希望,在衡量所有指标的语言质量方面显示出巨大的优势。