Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliament Corpus (GerParCor). GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data. In addition, GerParCor contains conversions of scanned protocols and, in particular, of protocols in Fraktur converted via an OCR process based on Tesseract. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date. GerParCor is made available in the XMI format of the UIMA project. In this way, GerParCor can be used as a large corpus of historical texts in the field of political communication for various tasks in NLP.
翻译:议会辩论代表着大量和部分未开发的可公开查阅的文本宝库,在德语地区,全国和联邦各级所有德语议会都缺乏统一可获取和附加说明的法规,为解决这一差距,我们介绍德国议会Corpus(GerParCor),GerParCor是来自三个世纪和四个国家的三个世纪和四个国家的(主要历史)德语议会协议(主要有历史意义),包括州和联邦一级的数据。此外,GerParCor还含有扫描协议的转换,特别是用基于Tesseract的OCR程序转换的Fraktur协议。所有协议都是通过NLP管道波赛3 进行预处理的,并自动附加了会议日期的元数据。GerParCor以UIMA项目XMI格式提供。通过这种方式,GerParCor可以用作NLP中各项任务的政治通信领域的大量历史文本。