Automatic document summarization aims to produce a concise summary covering the input document's salient information. Within a report document, the salient information can be scattered in the textual and non-textual content. However, existing document summarization datasets and methods usually focus on the text and filter out the non-textual content. Missing tabular data can limit produced summaries' informativeness, especially when summaries require covering quantitative descriptions of critical metrics in tables. Existing datasets and methods cannot meet the requirements of summarizing long text and multiple tables in each report. To deal with the scarcity of available data, we propose FINDSum, the first large-scale dataset for long text and multi-table summarization. Built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity. To summarize the long text and dozens of tables in each report, we present three types of summarization methods. Besides, we propose a set of evaluation metrics to assess the usage of numerical information in produced summaries. Dataset analyses and experimental results indicate the importance of jointly considering input textual and tabular data when summarizing report documents.
翻译:自动文档摘要旨在产生一份简明摘要,涵盖输入文件的突出信息。在一份报告文件中,突出的信息可以分散在文本和非文本内容中。然而,现有的文件摘要数据集和方法通常侧重于文本,并过滤非文本内容。缺少表格数据可以限制摘要的信息性,特别是当摘要需要包含表格中关键指标的定量说明时。现有的数据集和方法无法满足对每份报告中长文本和多表进行总结的要求。为了处理现有数据稀缺的问题,我们提议了FSSSum,即用于长文本和多表格总和的第一个大型数据集。根据来自3 794家公司的21 125份年度报告,它有两组用于概述每家公司业务和流动性结果的分类。为了总结每份报告中的长文本和数十个表格,我们提出了三种总结方法。此外,我们提出了一套评价指标,用以评估所编制摘要中数字信息的使用情况。数据集分析和实验结果表明,在总结文件时,必须共同考虑投入的文本和表格数据。