We present CrossSum, a large-scale cross-lingual abstractive summarization dataset comprising 1.7 million article-summary samples in 1500+ language pairs. We create CrossSum by aligning identical articles written in different languages via cross-lingual retrieval from a multilingual summarization dataset. We propose a multi-stage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also propose LaSE, a new metric for automatically evaluating model-generated summaries and showing a strong correlation with ROUGE. Performance on ROUGE and LaSE indicate that pretrained models fine-tuned on CrossSum consistently outperform baseline models, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first-ever that does not rely solely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum.
翻译:我们提出跨阶段数据抽样算法,以有效培训一种能够以任何目标语言总结文章的跨语言汇总模型。我们还提议使用LASE,一种自动评价模型生成摘要并显示与ROUGE的强烈关联的新指标。ROUGE和LASE的性能显示,预先训练的CrossSum基准模型始终超越了基准模型,即使源对和目标语言相配者在语言上相距遥远。据我们所知,CrossSum是最大的跨语言汇总数据集,也是第一个不完全依赖英语作为主要语言的汇总模型。我们正在发布数据集、校正和培训脚本,以及用于激励未来跨语言抽象汇总研究的模型。资源可以在 https://github.commm/ssemburetbuet.