We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with $k$ sentences, the algorithm only needs to execute $\approx k$ iterations, making it very efficient. We explain how to avoid explicit calculation of the full gradient and how to include sentence embedding information. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.
翻译:我们处理未经监督的采掘文件总和问题,特别是长篇文档。我们把未经监督的问题模型成一个稀疏的自动递减问题,通过一个受规范制约的组合问题来估计由此产生的组合问题。我们用专门的Frank-Wolfe算法来解决这个问题。为了生成一个包含$k$的概要,该算法只需要执行$\approx k$的迭代,使其非常有效。我们解释如何避免明确计算整个梯度,以及如何包括嵌入的句子信息。我们用词汇(标准)ROUGE评分和语义(基于编程的)评分来对照另外两种未经监督的方法来评估我们的方法。我们的方法在同时使用数据集和特别在与高解写摘要的嵌入结合时,效果会更好。