Traditionally, Text Simplification is treated as a monolingual translation task where sentences between source texts and their simplified counterparts are aligned for training. However, especially for longer input documents, summarizing the text (or dropping less relevant content altogether) plays an important role in the simplification process, which is currently not reflected in existing datasets. Simultaneously, resources for non-English languages are scarce in general and prohibitive for training new solutions. To tackle this problem, we pose core requirements for a system that can jointly summarize and simplify long source documents. We further describe the creation of a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon", consisting of almost 2900 documents. We release a document-aligned version that particularly highlights the summarization aspect, and provide statistical evidence that this resource is well suited to simplification as well. Code and data are available on Github: https://github.com/dennlinger/klexikon
翻译:传统上,文本简化被视为一种单一语言翻译任务,在这种翻译中,源文本与其简化的对应文本之间的句子要配合培训,然而,特别是对于较长的输入文件而言,总结文本(或完全减少不那么相关的内容)在简化过程中发挥着重要作用,而目前现有数据集中并未反映这一点。同时,非英语资源普遍稀缺,培训新的解决方案也难以使用。为解决这一问题,我们提出核心要求,建立一个能够共同总结和简化长源文件的系统。我们进一步描述了在德国维基百科和德国儿童词汇“Klexikon”的基础上,为联合文本简化和汇总创建新的数据集,该数据集由近2 900份文件组成。我们发布了一个文件校正版本,其中特别强调了总化方面,并提供统计证据,证明这一资源非常适合简化。可在Github网站上查阅代码和数据:https://github.com/dennlinger/klexikon:https://github.ken/klexikon。