Open4Piness(O4B):商业文件摘要的开放存取数据集 (Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents)

A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.

翻译：在对用于自动汇总的深层次学习模型进行微调方面,一个重大挑战是需要大型域特定数据集。从在线出版物等资源中整理这类数据所面临的障碍之一是如何利用适用于其再利用,特别是用于商业目的的许可证条例。因此,尽管有若干商业期刊可供使用,但没有大型数据集用于总结商业文件。在这项工作中,我们引入了Open4Business(O4B)数据集,共有17 458篇开放访问商业文章及其参考摘要。该数据集为商业领域的汇总提出了新的挑战,需要与其他现有数据集相比,高度抽象和更加简明的概要。此外,我们评估了这方面的现有模型,并由此表明,在O4B和7x大非开放访问数据集方面培训的模型取得了可比较的总结性能。我们发布了数据集,以及可用于同样收集多个领域数据的代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

深度学习搜索，Exploring Deep Learning for Search

专知会员服务

61+阅读 · 2020年5月9日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日