CLTS+:中国新长文本摘要数据集,包含摘要摘要 (CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries)

The abstractive methods lack of creative ability is particularly a problem in automatic text summarization. The summaries generated by models are mostly extracted from the source articles. One of the main causes for this problem is the lack of dataset with abstractiveness, especially for Chinese. In order to solve this problem, we paraphrase the reference summaries in CLTS, the Chinese Long Text Summarization dataset, correct errors of factual inconsistencies, and propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+, which contains more than 180K article-summary pairs and is available online. Additionally, we introduce an intrinsic metric based on co-occurrence words to evaluate the dataset we constructed. We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data and train several baselines on CLTS+ to verify the utility of it for improving the creative ability of models.

翻译：缺乏创造性能力的抽象方法在自动文本总结中尤其是一个问题。模型产生的摘要大多取自来源文章。这个问题的主要原因之一是缺乏具有抽象性的数据集,特别是对于中国人来说。为了解决这个问题,我们在CLTS中转写参考摘要、中文长文本总结数据集、纠正事实不一致的错误,并提议第一个具有高度抽象性的中国长文本总结数据集,CLTS+,其中载有180K多条文章摘要,可在网上查阅。此外,我们采用基于共生词的内在指标来评价我们所构建的数据集。我们对照其他数据集分析了CLTS+摘要中使用的提取战略,以量化我们新数据的抽象性和难度,并在CLTS+上培训若干基线,以核实它对于提高模型创造性能力的效用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日