Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4% accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.
翻译:对多种语言的文本总结进行了深入研究,有些语言已经进入了高级阶段。然而,阿拉伯语文本总结(ATS)仍处于发展阶段。现有的苯丙胺类兴奋剂数据集要么小,要么缺乏多样性。我们为阿拉伯语文本总结任务建立了一个大规模和多样化的数据集。LANS在1999年至2019年期间提供了840万篇文章及其摘自报纸网站元数据的摘要。来自22个主要阿拉伯报纸的记者撰写了高质量和多样的摘要,其中包括每个来源至少7个专题的精选组合。我们通过自动和人文评估对LANS进行内在评估。人类对1000个随机样本的评估报告所收集摘要的准确性95.4%,自动评估量化了摘要的多样性和抽象性。应请求,数据集可以公开查阅。