HunSum-1:匈牙利文的抽象摘要数据集 (HunSum-1: an Abstractive Summarization Dataset for Hungarian)

We introduce HunSum-1: a dataset for Hungarian abstractive summarization, consisting of 1.14M news articles. The dataset is built by collecting, cleaning and deduplicating data from 9 major Hungarian news sites through CommonCrawl. Using this dataset, we build abstractive summarizer models based on huBERT and mT5. We demonstrate the value of the created dataset by performing a quantitative and qualitative analysis on the models' results. The HunSum-1 dataset, all models used in our experiments and our code are available open source.

翻译：我们引入了HunSum-1:匈牙利抽象总结数据集,由1.14M新闻文章组成。数据集是通过收集、清理和通过“共同搜索”从匈牙利9个主要新闻站点通过“共同搜索”复制数据而建立的。我们利用这一数据集,根据HuBERT和mT5建立了抽象的总结模型。我们通过对模型结果进行定量和定性分析,展示了创建数据集的价值。HunSum-1数据集、我们在实验中使用的所有模型和我们的代码都是开放源码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

专知会员服务

33+阅读 · 2020年4月24日