Cross-Lingual Summarization (CLS) aims at generating summaries in one language for the given documents in another language. CLS has attracted wide research attention due to its practical significance in the multi-lingual world. Though great contributions have been made, existing CLS works typically focus on short documents, such as news articles, short dialogues and guides. Different from these short texts, long documents such as academic articles and business reports usually discuss complicated subjects and consist of thousands of words, making them non-trivial to process and summarize. To promote CLS research on long documents, we construct Perseus, the first long-document CLS dataset which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens. As a preliminary study on long-document CLS, we build and evaluate various CLS baselines, including pipeline and end-to-end methods. Experimental results on Perseus show the superiority of the end-to-end baseline, outperforming the strong pipeline models equipped with sophisticated machine translation systems. Furthermore, to provide a deeper understanding, we manually analyze the model outputs and discuss specific challenges faced by current approaches. We hope that our work could benchmark long-document CLS and benefit future studies.
翻译:跨语言摘要(CLS)旨在为另一种语言的文件制作一种语言的概要。 CLS由于在多语言世界中的实际意义而引起了广泛的研究关注。虽然已经做出了巨大贡献,但现有的CLS工作通常侧重于短的文件,例如新闻文章、简短对话和指南。与这些短文不同,学术文章和商业报告等长篇文件通常讨论复杂的主题,由数千个字组成,使它们无法处理和总结。为了促进CLS对长篇文件的研究,我们建造了Perseus,这是第一个长篇的CLS数据集,收集了大约94K中国科学文件,并配有英文摘要。Perseus文件的平均长度超过2000个符号。作为长篇文件CLS的初步研究,我们建立和评估了CLS的各种基线,包括管道和终端至终端方法。Persus的实验结果显示了端端端基线的优越性,超过了配备精密机器翻译系统的强大的管道模型。此外,为了提供更深入的理解,我们手动地分析模型产出以及我们当前面临的具体挑战。