We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.
翻译:我们提出ClidSum,这是在对话文件上建立跨语言汇总系统的基准数据集,由两个子集(即SAMSum和MediaSum)的67k+对话文件组成,由67k+对话文件组成,以不同目标语言提供112k+附加摘要。根据拟议的ClidSum,我们分别为受监督和半监督的情景引入了两个基准设置。然后,我们在不同模式(管道和终端至终端)中构建了各种基准系统,并在ClidSum上进行了广泛的实验,以提供更深入的分析。此外,我们提议通过进一步的培训前阶段扩展mBART-50(一种多语言的BART)和112k+附加摘要。在进一步的培训前阶段使用的多重目标有助于培训前模型捕捉结构特征以及对话中的重要内容和从源到目标语言的转变。实验结果显示MDIALBART作为一种端到终端模式的优越性,超越了在CIidSum上的强大管道模型。我们讨论了当前在这项任务中所面临的具体挑战,并为未来研究提供了多种有希望的方向。我们已在MACsetm/com上发布了数据。