A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.
翻译:在对用于自动汇总的深层次学习模型进行微调方面,一个重大挑战是需要大型域特定数据集。从在线出版物等资源中整理这类数据所面临的障碍之一是如何利用适用于其再利用,特别是用于商业目的的许可证条例。因此,尽管有若干商业期刊可供使用,但没有大型数据集用于总结商业文件。在这项工作中,我们引入了Open4Business(O4B)数据集,共有17 458篇开放访问商业文章及其参考摘要。该数据集为商业领域的汇总提出了新的挑战,需要与其他现有数据集相比,高度抽象和更加简明的概要。此外,我们评估了这方面的现有模型,并由此表明,在O4B和7x大非开放访问数据集方面培训的模型取得了可比较的总结性能。我们发布了数据集,以及可用于同样收集多个领域数据的代码。