In recent years, fully automated content analysis based on probabilistic topic models has become popular among social scientists because of their scalability. The unsupervised nature of the models makes them suitable for exploring topics in a corpus without prior knowledge. However, researchers find that these models often fail to measure specific concepts of substantive interest by inadvertently creating multiple topics with similar content and combining distinct themes into a single topic. In this paper, we empirically demonstrate that providing a small number of keywords can substantially enhance the measurement performance of topic models. An important advantage of the proposed keyword assisted topic model (keyATM) is that the specification of keywords requires researchers to label topics prior to fitting a model to the data. This contrasts with a widespread practice of post-hoc topic interpretation and adjustments that compromises the objectivity of empirical findings. In our application, we find that keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models. Finally, we show that keyATM can also incorporate covariates and model time trends. An open-source software package is available for implementing the proposed methodology.
翻译:近年来,基于概率性专题模型的完全自动化内容分析因其可扩缩性而为社会科学家所流行。模型的未经监督的性质使其适合于在未经事先了解的情况下在一文中探讨专题。然而,研究人员发现,这些模型往往无法通过无意中创建内容相似的多个专题并将不同主题合并成一个单一专题来衡量实质性关注的具体概念。在本文件中,我们从经验上表明,提供少量关键词可以大大增强专题模型的计量性能。拟议的关键词辅助专题模型(KeyATM)的一个重要优点是,关键词模型的规格要求研究人员在将一个模型与数据相适应之前对专题进行标签。这与广泛采用的事后主题解释和调整做法形成对照,后者损害了实证结果的客观性。在我们的应用中,我们发现,关键ATM提供了更多的可解释结果,具有更好的文件分类性,对专题数目的敏感度比标准专题模型要低。最后,我们证明关键关键关键关键词辅助主题模型还可以包含组合和模型时间趋势。有一个用于实施拟议方法的开放源软件包。