Klexikon:用于联合归纳和简化的德国数据集 (Klexikon: A German Dataset for Joint Summarization and Simplification)

Traditionally, Text Simplification is treated as a monolingual translation task where sentences between source texts and their simplified counterparts are aligned for training. However, especially for longer input documents, summarizing the text (or dropping less relevant content altogether) plays an important role in the simplification process, which is currently not reflected in existing datasets. Simultaneously, resources for non-English languages are scarce in general and prohibitive for training new solutions. To tackle this problem, we pose core requirements for a system that can jointly summarize and simplify long source documents. We further describe the creation of a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon", consisting of almost 2900 documents. We release a document-aligned version that particularly highlights the summarization aspect, and provide statistical evidence that this resource is well suited to simplification as well. Code and data are available on Github: https://github.com/dennlinger/klexikon

翻译：传统上,文本简化被视为一种单一语言翻译任务,在这种翻译中,源文本与其简化的对应文本之间的句子要配合培训,然而,特别是对于较长的输入文件而言,总结文本(或完全减少不那么相关的内容)在简化过程中发挥着重要作用,而目前现有数据集中并未反映这一点。同时,非英语资源普遍稀缺,培训新的解决方案也难以使用。为解决这一问题,我们提出核心要求,建立一个能够共同总结和简化长源文件的系统。我们进一步描述了在德国维基百科和德国儿童词汇“Klexikon”的基础上,为联合文本简化和汇总创建新的数据集,该数据集由近2 900份文件组成。我们发布了一个文件校正版本,其中特别强调了总化方面,并提供统计证据,证明这一资源非常适合简化。可在Github网站上查阅代码和数据:https://github.com/dennlinger/klexikon:https://github.ken/klexikon。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日