Existing summarization datasets come with two main drawbacks: (1) They tend to focus on overly exposed domains, such as news articles or wiki-like texts, and (2) are primarily monolingual, with few multilingual datasets. In this work, we propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex). Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages. In this work, the data acquisition process is detailed and key characteristics of the resource are compared to existing summarization resources. In particular, we illustrate challenging sub-problems and open questions on the dataset that could help the facilitation of future research in the direction of domain-specific cross-lingual summarization. Limited by the extreme length and language diversity of samples, we further conduct experiments with suitable extractive monolingual and cross-lingual baselines for future work. Code for the extraction as well as access to our data and baselines is available online at: https://github.com/achouhan93/eur-lex-sum.
翻译:现有总和数据集有两个主要缺点:(1) 这些文件和各自摘要往往侧重于过度暴露的领域,例如新闻文章或维基类文本,(2) 主要是单语,很少多语种数据集;在这项工作中,我们提议根据欧洲联盟法律平台(ECUR-Lex-Sum)的法律行为人工整理的文件摘要,建立一个称为EUR-Lex-Sum的新数据集,称为EUR-Lex-Sum。文件及其各自摘要是24种欧洲正式语文中若干种语文的跨语言段落统一数据,有助于获得各种跨语言和资源较少的合成组合。我们每种语文获得1 500对文件/摘要配对,包括375个跨语言统一的法律文件,所有24种语文都有这些文本。在这项工作中,数据采集过程很详细,资源的主要特点与现有的汇总资源比较。特别是,我们说明了数据集中具有挑战性的次级问题和开放问题,可以帮助今后对特定域跨语言汇总进行研究。我们通过极长的长度和多种语言对各语种的组合组合,将未来检索数据基准,我们用极多的跨语言进行跨版/在线测试。