EUR-Lex-Sum:法律领域长式摘要的多语种和跨语种数据集 (EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain)

Existing summarization datasets come with two main drawbacks: (1) They tend to focus on overly exposed domains, such as news articles or wiki-like texts, and (2) are primarily monolingual, with few multilingual datasets. In this work, we propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex). Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages. In this work, the data acquisition process is detailed and key characteristics of the resource are compared to existing summarization resources. In particular, we illustrate challenging sub-problems and open questions on the dataset that could help the facilitation of future research in the direction of domain-specific cross-lingual summarization. Limited by the extreme length and language diversity of samples, we further conduct experiments with suitable extractive monolingual and cross-lingual baselines for future work. Code for the extraction as well as access to our data and baselines is available online at: https://github.com/achouhan93/eur-lex-sum.

翻译：现有总和数据集有两个主要缺点:(1) 这些文件和各自摘要往往侧重于过度暴露的领域,例如新闻文章或维基类文本,(2) 主要是单语,很少多语种数据集;在这项工作中,我们提议根据欧洲联盟法律平台(ECUR-Lex-Sum)的法律行为人工整理的文件摘要,建立一个称为EUR-Lex-Sum的新数据集,称为EUR-Lex-Sum。文件及其各自摘要是24种欧洲正式语文中若干种语文的跨语言段落统一数据,有助于获得各种跨语言和资源较少的合成组合。我们每种语文获得1 500对文件/摘要配对,包括375个跨语言统一的法律文件,所有24种语文都有这些文本。在这项工作中,数据采集过程很详细,资源的主要特点与现有的汇总资源比较。特别是,我们说明了数据集中具有挑战性的次级问题和开放问题,可以帮助今后对特定域跨语言汇总进行研究。我们通过极长的长度和多种语言对各语种的组合组合,将未来检索数据基准,我们用极多的跨语言进行跨版/在线测试。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日