The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.
翻译:现有大多数文本摘要数据集包括缺乏长期因果关系和时间依赖性的短源文件,而且往往包含强烈的布局和文体偏见。虽然这些数据集具有相关性,但对今后几代文本摘要系统提出了有限的挑战。我们通过采用用于长式叙述摘要化的数据集集BookSum来解决这些问题。我们的数据集涵盖文献领域的源文件,如小说、剧本和故事,包括高度抽象的、人文的、关于日益困难的三层颗粒的概要:段落、章节和书级。我们的数据集的域和结构对汇总系统提出了一套独特的挑战,其中包括:处理非常长的文件、非三重性因果关系和时间依赖性,以及丰富的谈话结构。为了便利未来的工作,我们培训和评价了多种采掘和抽象的汇总模型,作为我们数据集的基线。