We introduce HunSum-1: a dataset for Hungarian abstractive summarization, consisting of 1.14M news articles. The dataset is built by collecting, cleaning and deduplicating data from 9 major Hungarian news sites through CommonCrawl. Using this dataset, we build abstractive summarizer models based on huBERT and mT5. We demonstrate the value of the created dataset by performing a quantitative and qualitative analysis on the models' results. The HunSum-1 dataset, all models used in our experiments and our code are available open source.
翻译:我们引入了HunSum-1:匈牙利抽象总结数据集,由1.14M新闻文章组成。数据集是通过收集、清理和通过“共同搜索”从匈牙利9个主要新闻站点通过“共同搜索”复制数据而建立的。我们利用这一数据集,根据HuBERT和mT5建立了抽象的总结模型。我们通过对模型结果进行定量和定性分析,展示了创建数据集的价值。HunSum-1数据集、我们在实验中使用的所有模型和我们的代码都是开放源码。