Topic models are among the most widely used methods in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models use unsupervised methods and hence require the additional step of attaching meaningful labels to estimated topics. This process of manual labeling is not scalable and often problematic because it depends on the domain expertise of the researcher and may be affected by cardinality in human decision making. As a consequence, insights drawn from a topic model are difficult to replicate. We present a semi-automatic transfer topic labeling method that seeks to remedy some of these problems. We take advantage of the fact that domain-specific codebooks exist in many areas of research that can be exploited for automated topic labeling. We demonstrate our approach with a dynamic topic model analysis of the complete corpus of UK House of Commons speeches from 1935 to 2014, using the coding instructions of the Comparative Agendas Project to label topics. We show that our method works well for a majority of the topics we estimate, but we also find institution-specific topics, in particular on subnational governance, that require manual input. The method proposed in the paper can be easily extended to other areas with existing domain-specific knowledge bases, such as party manifestos, open-ended survey questions, social media data, and legal documents, in ways that can add knowledge to research programs.
翻译:专题模型是自然语言处理中最广泛使用的方法之一,使研究人员能够在文件汇编中估计基本主题。大多数专题模型使用不受监督的方法,因此需要采取额外步骤,将有意义的标签附加到估计主题中。手工标签过程不易扩展,而且往往有问题,因为它取决于研究人员的域内专长,并可能受到人类决策的根本性影响。因此,从专题模型中得出的见解难以复制。我们提出了一个半自动转移主题标签方法,试图纠正其中一些问题。我们利用这个方法,在许多研究领域存在特定领域的守则,可以用来自动标注专题。我们展示了我们对1935年至2014年英国众议院演讲全套内容的动态主题模型分析方法,使用比较议程项目的编码指示来标注专题。我们从一个专题模型中得出的见解很难复制。我们发现了一种半自动转移主题标签方法,寻求解决这些问题中的一些问题的补救。我们利用这个方法来利用这个方法在许多研究领域存在特定领域,可以用于自动标注主题。我们提出的方法可以很容易地扩展到其他领域,以现有具体区域研究方法为基础,以现有具体法律研究方式增加社会知识基础。