As the amount of text data generated by humans and machines increases, the necessity of understanding large corpora and finding a way to extract insights from them is becoming more crucial than ever. Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. In this paper, we introduce the Aligned Neural Topic Model (ANTM), a dynamic neural topic model that uses document embeddings to compute clusters of semantically similar documents at different periods and to align document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).
翻译:随着人类和机器生成的文本数据数量的增加,理解大型整体和从中获取洞见的方法的必要性变得比以往任何时候都更加重要。动态主题模型是有效方法,主要侧重于研究文件收藏中各专题的演变。这些模型被广泛用于了解趋势,在社交网络中探索公共舆论,或跟踪科学档案中的研究进展和发现。由于这些专题被定义为由语义上相似的文件组成的集群,因此有必要观察这些集群的内容或主题的变化,以便了解这些集群的内容或主题是如何随着新知识的不断发现而演变的。在本文件中,我们引入了统一神经专题模型(ANTM),这是一个动态神经专题模型,它使用文件嵌入式模型在不同时期对语义上相似的文件进行编集,并统一文件集群以体现其演变情况。这一调整程序保持了文件集群在不同时期的时间相似性,并捕捉了以其背景为特征的词语的语义变化。在四个不同的数据集上进行的实验表明,ANTM系统在动态主题模型上超越了可比较性动态模型(e.g.delisality 和DTM) 显著改进了其他动态模型。