Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.
翻译:最近,由于可以产生基于用户利益的有区别的摘要,因此,最近人们更加关注以孔或查询为基础的摘要,因为可以根据用户的利益产生有区别的摘要。然而,目前关于方面或以查询为基础的摘要的数据集,要么侧重于特定领域,包含相对小的事例,或者只包括少数几类。这些限制妨碍了朝这个方向进一步探索。在这项工作中,我们利用维基百科.org上的众包知识,并自动创建出一个称为OASum的高质量、大型开放域的开放方方面面汇总数据集,该数据集包含370多万个实例,其中200万维基百科页面上约有100万个不同方面。我们提供OAsum的基准结果,并展示其在不同方面基于汇总的生成能力。为了克服特定领域的数据稀缺问题,我们还对七个下游数据集进行零射、几发和微调。具体地说,零/few-shot和微调结果显示,我们预先培训的模型显示与主干模型相比具有很强的方面或以询问为重点的一代能力。我们的数据设置和预先训练过的检查站是公开的。