Large, diachronic datasets of political discourse are hard to come across, especially for resource-lean languages such as Greek. In this paper, we introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020. It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files. We explain how it was constructed and the challenges that we had to overcome. The dataset can be used for both computational linguistics and political analysis-ideally, combining the two. We present such an application, showing (i) how the dataset can be used to study the change of word usage through time, (ii) between significant historical events and political parties, (iii) by evaluating and employing algorithms for detecting semantic shifts.
翻译:从5 355个议会记录文档中提取了100多万篇带有广泛元数据的讲话,我们解释了这些讲话是如何构建的,以及我们必须克服的挑战。这些数据集既可用于计算语言,也可用于政治分析,将两者结合起来。我们提出这样的应用程序,显示(一) 如何利用这些数据集来研究随着时间的推移、(二) 重大历史事件与政党之间用词的变化,(三) 通过评估和运用算法来检测语义变化。