The abstractive methods lack of creative ability is particularly a problem in automatic text summarization. The summaries generated by models are mostly extracted from the source articles. One of the main causes for this problem is the lack of dataset with abstractiveness, especially for Chinese. In order to solve this problem, we paraphrase the reference summaries in CLTS, the Chinese Long Text Summarization dataset, correct errors of factual inconsistencies, and propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+, which contains more than 180K article-summary pairs and is available online. Additionally, we introduce an intrinsic metric based on co-occurrence words to evaluate the dataset we constructed. We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data and train several baselines on CLTS+ to verify the utility of it for improving the creative ability of models.
翻译:缺乏创造性能力的抽象方法在自动文本总结中尤其是一个问题。模型产生的摘要大多取自来源文章。这个问题的主要原因之一是缺乏具有抽象性的数据集,特别是对于中国人来说。为了解决这个问题,我们在CLTS中转写参考摘要、中文长文本总结数据集、纠正事实不一致的错误,并提议第一个具有高度抽象性的中国长文本总结数据集,CLTS+,其中载有180K多条文章摘要,可在网上查阅。此外,我们采用基于共生词的内在指标来评价我们所构建的数据集。我们对照其他数据集分析了CLTS+摘要中使用的提取战略,以量化我们新数据的抽象性和难度,并在CLTS+上培训若干基线,以核实它对于提高模型创造性能力的效用。