Unsupervised approaches to extractive summarization usually rely on a notion of sentence importance defined by the semantic similarity between a sentence and the document. We propose new metrics of relevance and redundancy using pointwise mutual information (PMI) between sentences, which can be easily computed by a pre-trained language model. Intuitively, a relevant sentence allows readers to infer the document content (high PMI with the document), and a redundant sentence can be inferred from the summary (high PMI with the summary). We then develop a greedy sentence selection algorithm to maximize relevance and minimize redundancy of extracted sentences. We show that our method outperforms similarity-based methods on datasets in a range of domains including news, medical journal articles, and personal anecdotes.
翻译:未经监督的抽取总结方法通常依赖于由句子和文件的语义相似性所定义的判刑重要性概念。 我们使用经事先培训的语言模式很容易计算出来的对等相互信息(PMI)提出新的相关性和冗余度指标。 相关句子可以让读者推断文件内容(文件加高PMI),也可以从摘要中推断出多余的句子(摘要加高PMI)。 然后我们开发了贪婪的量刑选择算法,以尽量扩大提取的句子的相关性和冗余度。 我们展示了我们的方法在包括新闻、医学期刊文章和个人异常在内的一系列领域,在数据集方面优于以类似性为基础的方法。