CNewSum:大规模中国新闻摘要数据集,带有人文说明的适足性和可下载性水平 (CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level)

Automatic text summarization aims to produce a brief but crucial summary for the input documents. Both extractive and abstractive methods have witnessed great success in English datasets in recent years. However, there has been a minimal exploration of text summarization in Chinese, limited by the lack of large-scale datasets. In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set contains adequacy and deducibility annotations for the summaries. The adequacy level measures the degree of summary information covered by the document, and the deducibility indicates the reasoning ability the model needs to generate the summary. These annotations can help researchers analyze and target their model performance bottleneck. We examine recent methods on CNewSum and release our dataset to provide a solid testbed for automatic Chinese summarization research.

翻译：自动文本摘要旨在为输入文件制作一个简短但关键的摘要。近年来,在英文数据集中,采掘和抽象方法都取得了巨大成功。然而,对中文文本摘要的探索极少,但因缺少大规模数据集而受到限制。在本文中,我们展示了中国大规模新闻摘要数据集CNewSum, 其中包括304,307份文件和供新闻反馈用的人文文件摘要。它有高吸附性摘要的长文件,可以鼓励文件级理解和生成当前合成模型。CNewSum的另一个显著特征是,其测试集包含对摘要的充分和可启发性说明。衡量文件所涉摘要信息的程度的适足程度,以及可理解性表明模型生成摘要所需的推理能力。这些说明有助于研究人员分析和瞄准其模型性能瓶颈。我们研究了CNewSum的近期方法,并公布了我们的数据集,以便为中国自动合成研究提供一个可靠的测试台。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【Google AI论文】无妥协的弱监督解缠，Weakly-Supervised Disentanglement Without Compromises

专知会员服务

20+阅读 · 2020年2月12日