This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.
翻译:本文讨论了用于历史研究的日新月异数据分析中的方法问题。我们用两个系列的专题模型(LDA和DTM)在数量较大的历史报纸上使用,目的是捕捉和理解讨论动态。我们的案例研究侧重于1854年至1917年在芬兰出版的报纸和期刊,但我们的方法很容易被移植到任何日新月异数据中。我们的主要贡献是:(a) 将专题模型应用于巨大和不平衡的日新月异文本收集的综合抽样、培训和推论程序;(b) 讨论这类数据两个专题模型之间的差异;(c) 将某一时期的专题突出程度量化,从而将文件专题分配到一个讨论级别;(d) 讨论人文解释在通过专题模型分析讨论动态方面的作用。