HTMOT: 长期等级主题建模 (HTMOT : Hierarchical Topic Modelling Over Time)

Over the years, topic models have provided an efficient way of extracting insights from text. However, while many models have been proposed, none are able to model topic temporality and hierarchy jointly. Modelling time provide more precise topics by separating lexically close but temporally distinct topics while modelling hierarchy provides a more detailed view of the content of a document corpus. In this study, we therefore propose a novel method, HTMOT, to perform Hierarchical Topic Modelling Over Time. We train HTMOT using a new implementation of Gibbs sampling, which is more efficient. Specifically, we show that only applying time modelling to deep sub-topics provides a way to extract specific stories or events while high level topics extract larger themes in the corpus. Our results show that our training procedure is fast and can extract accurate high-level topics and temporally precise sub-topics. We measured our model's performance using the Word Intrusion task and outlined some limitations of this evaluation method, especially for hierarchical models. As a case study, we focused on the various developments in the space industry in 2020.

翻译：多年来,专题模型为从文本中提取见解提供了有效的方法,然而,虽然提出了许多模型,但没有哪个模型能够共同模拟专题的时间性和等级。模型时间提供了更精确的专题,在词汇上接近但时间上不同的专题之间进行了区分,而模型等级则对文件资料的内容提供了更详尽的描述。因此,在这项研究中,我们提出了一个新颖的方法,即HTMOT,用新的Gibbs取样方法对HTMOT进行了培训,这更有效率。具体地说,我们显示,只有对深层次的子专题进行时间建模,才能提取具体的故事或事件,而高层次的专题则在主体中提取更大的主题。我们的结果显示,我们的培训程序是快速的,能够提取准确的高层次专题和时间精确的分专题。我们用Word Instrucition任务衡量了我们的模型的绩效,并概述了这种评价方法的一些局限性,特别是等级模型。我们的一项案例研究是2020年空间工业的各种发展。