Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Clusters were leveraged to indicate information saliency and to avoid redundancy. These methods focused on clustering sentences, even though closely related sentences also usually contain non-aligning information. In this work, we revisit the clustering approach, grouping together propositions for more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
翻译:多文件汇总(MDS)传统上将文本分组方法纳入多文件汇总(MDS),作为应对大量信息重复的一种手段;利用分组来显示信息显著性并避免冗余;这些方法侧重于组合句子,尽管相互密切相关的句子通常也包含不匹配的信息;在这项工作中,我们重新审视分组方法,将更精确的信息一致的主张组合在一起;具体地说,我们的方法发现突出的主张,将其组合成副词组,并通过套用其主张为每个组子产生具有代表性的句子。我们的组合法改进了2004年DUC和2011 TAC 数据组中以前的最先进的MDS方法,在自动ROUGE分数和人类偏好中都是如此。