Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.
翻译:主题建模已成为研究大量文档集合的主要方法。最近的主题建模方法使用大型上下文化语言模型和变分自动编码器。本文提出一种基于负采样机制的上下文主题模型,以提高生成主题的质量。具体而言,在模型训练期间,我们扰动生成的文档-主题向量,并使用三元组损失函数,以鼓励从正确的文档-主题向量重建的文档与输入文档相似且与从扰动向量重建的文档不相似。在三个公开可用的基准数据集上进行不同主题数量的实验表明,在大多数情况下,我们的方法导致主题连贯性的提高。本文提出的模型还实现了非常高的主题多样性。