OASum: 大型开放域域基于外观的汇总 (OASum: Large-Scale Open Domain Aspect-based Summarization)

Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.

翻译：最近,由于可以产生基于用户利益的有区别的摘要,因此,最近人们更加关注以孔或查询为基础的摘要,因为可以根据用户的利益产生有区别的摘要。然而,目前关于方面或以查询为基础的摘要的数据集,要么侧重于特定领域,包含相对小的事例,或者只包括少数几类。这些限制妨碍了朝这个方向进一步探索。在这项工作中,我们利用维基百科.org上的众包知识,并自动创建出一个称为OASum的高质量、大型开放域的开放方方面面汇总数据集,该数据集包含370多万个实例,其中200万维基百科页面上约有100万个不同方面。我们提供OAsum的基准结果,并展示其在不同方面基于汇总的生成能力。为了克服特定领域的数据稀缺问题,我们还对七个下游数据集进行零射、几发和微调。具体地说,零/few-shot和微调结果显示,我们预先培训的模型显示与主干模型相比具有很强的方面或以询问为重点的一代能力。我们的数据设置和预先训练过的检查站是公开的。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日