Automatic summary assessment is useful for both machine-generated and human-produced summaries. Automatically evaluating the summary text given the document enables, for example, summary generation system development and detection of inappropriate summaries. Summary assessment can be run in a number of modes: ranking summary generation systems; ranking summaries of a particular document; and estimating the quality of a document-summary pair on an absolute scale. Existing datasets with annotation for summary assessment are usually based on news summarization datasets such as CNN/DailyMail or XSum. In this work, we describe a new dataset, the podcast summary assessment corpus, a collection of podcast summaries that were evaluated by human experts at TREC2020. Compared to existing summary assessment data, this dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus. First, we examine existing assessment methods, including model-free and model-based methods, and provide benchmark results for this long-input summary assessment dataset. Second, with the aim of filtering reference summary-document pairings for training, we apply summary assessment for data selection. The experimental results on these two aspects provide interesting insights on the summary assessment and generation tasks. The podcast summary assessment data is available.
翻译:自动简要评估对机制和人造摘要都有用。自动评估文件所提供的摘要文本,可以进行简要生成系统开发和检测不适当的摘要。简要评估可以采用多种模式进行:对摘要生成系统进行排序;对特定文件进行排序摘要;对绝对规模的文件摘要的质量进行估算;现有的带有简要评估注释的数据集通常以新闻汇总数据集为基础,如CNN/DailyMail或XSum。在这项工作中,我们描述了一个新的数据集,即播客摘要评估材料,这是由人类专家在TREC202020年会议上评价的播客摘要汇编。与现有的简要评估数据相比,这一数据集有两个独特的方面:(一) 长期投入、语音播客、文件;(二) 有机会在播客数据库中发现不适当的参考摘要。首先,我们研究了现有的评估方法,包括无模型和基于模型的方法,并为这一长期投入的简要评估数据集提供了基准结果。第二,与现有简要评估相比,该数据集具有过滤性参考摘要的双重目的,为数据摘要选择提供令人感兴趣的分析。