Topic models aim to reveal the latent structure behind a corpus, typically conducted over a bag-of-words representation of documents. In the context of topic modeling, most vocabulary is either irrelevant for uncovering underlying topics or contains strong relationships with relevant concepts, impacting the interpretability of these topics. Furthermore, their limited expressiveness and dependency on language demand considerable computation resources. Hence, we propose a novel approach for cluster-based topic modeling that employs conceptual entities. Entities are language-agnostic representations of real-world concepts rich in relational information. To this end, we extract vector representations of entities from (i) an encyclopedic corpus using a language model; and (ii) a knowledge base using a graph neural network. We demonstrate that our approach consistently outperforms other state-of-the-art topic models across coherency metrics and find that the explicit knowledge encoded in the graph-based embeddings provides more coherent topics than the implicit knowledge encoded with the contextualized embeddings of language models.
翻译:专题模型旨在揭示一个主体背后的潜在结构,通常是用一袋字表来代表文件。在专题模型方面,大多数词汇要么与发现基本专题无关,要么与相关概念有着密切的关系,从而影响这些专题的解释性。此外,它们有限的表达性和对语言的依赖性要求大量的计算资源。因此,我们建议对集束专题模型采取新颖的办法,以使用概念实体。实体是具有丰富关联信息的真实世界概念的语言不可知性表示。为此,我们从(一) 使用语言模型的百科全书中提取实体的矢量表示;以及(二) 使用图形神经网络的知识库。我们证明,我们的方法始终超越了其他统一度指标中最先进的专题模型,并发现图表嵌入的明显知识提供了比以语言模型背景化嵌入为编码的隐含知识更为一致的专题。