Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. This allows our model to detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.
翻译:提取和识别大型文本语料库中的潜在主题在自然语言处理(NLP)中变得日益重要。大多数模型,无论是类似于潜在狄利克雷分配(LDA)的概率模型还是神经主题模型,都遵循相同的主题可解释性和主题提取基本原理。我们提出了一种方法,将更深层次的理解句子和文档主题融入模型,并超越了简单分析数据中单词频率的思路。这使我们的模型能够检测到包括不常见单词或新词汇,以及文档本身不存在的单词的潜在主题。此外,我们提出了几个新的基于干扰单词和语义空间相似度测量的评估指标。我们呈现了与人类识别干扰单词的相关系数,并在单词干扰任务中实现接近人类水平的结果。我们通过大型基准研究展示了我们方法的竞争性能,并实现了优于现有最先进的主题建模和文档聚类模型的结果。