It has been reported that clustering-based topic models, which cluster high-quality sentence embeddings with an appropriate word selection method, can generate better topics than generative probabilistic topic models. However, these approaches suffer from the inability to select appropriate parameters and incomplete models that overlook the quantitative relation between words with topics and topics with text. To solve these issues, we propose graph to topic (G2T), a simple but effective framework for topic modelling. The framework is composed of four modules. First, document representation is acquired using pretrained language models. Second, a semantic graph is constructed according to the similarity between document representations. Third, communities in document semantic graphs are identified, and the relationship between topics and documents is quantified accordingly. Fourth, the word--topic distribution is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths. Human judgements demonstrate that G2T can produce topics with better interpretability and coverage than baselines. In addition, G2T can not only determine the topic number automatically but also give the probabilistic distribution of words in topics and topics in documents. Finally, G2T is publicly available, and the distillation experiments provide instruction on how it works.
翻译:据报道,基于聚类的主题模型利用适当的词语选择方法和高质量的句子嵌入,可以生成比生成式概率主题模型更好的主题。然而,这些方法在选择适当参数和忽略文本中词语与主题以及主题与文本之间的定量关系的不完整模型方面存在问题。为了解决这些问题,我们提出了图到主题(G2T)——一种简单但有效的主题建模框架。该框架由四个模块组成。首先,使用预训练语言模型获取文档表示形式。其次,根据文档表示之间的相似度构建语义图。第三,识别文档语义图中的社区,并相应地量化主题与文档之间的关系。第四,基于TFIDF的变体计算词-主题分布。自动评估表明,G2T在长度不同的英文和中文文档上均实现了最先进的性能。人类判断论证了G2T相比对照组可以生产具有更好可解释性和覆盖性的主题。此外,G2T不仅可以自动确定主题个数,而且可以给出词在主题中和主题在文档中的概率分布。最后,G2T是公开可用的,蒸馏实验提供了关于它的工作说明。