We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German, and the methodology for its creation can be applied to several other languages. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles. We analyse the proposed cross-lingual summarisation task with automatic metrics and validate it with a human study. To illustrate the utility of our dataset we report experiments with multi-lingual pre-trained models in supervised, zero- and few-shot, and out-of-domain scenarios.
翻译:我们用一种与目标语言的多语种摘要相关的来源语言提供长长文件的跨语言摘要汇编,涵盖四种欧洲语言,即捷克语、英语、法语和德语的12种语言配对和方向,其创建方法可以适用于其他几种语言。我们从维基百科获得跨语言文件摘要,将主导段落和文章机构与与维基百科标题相一致的语言合并。我们用自动衡量标准分析拟议的跨语言总结任务,并通过一项人类研究加以验证。为了说明我们的数据集的效用,我们报告在监督、零和少见的情景中,以及外在情景中,用多语言预先培训的模式进行实验。