Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.
翻译:专题模型是文本公司自动发现专题的突出工具,尽管具有效力,但专题模型也有一些局限性,包括无法在文件中进行文字排序的模拟,难以纳入外部语言知识,以及缺乏准确和高效的推论方法来接近棘手的后遗物。最近,由于文本的优异表述,预先培训的语言模型(PLMS)使大量任务取得了惊人的绩效改进。有趣的是,没有标准的方法将PLMS用于专题发现,作为专题发现更好的替代模式。在本文中,我们首先分析使用PLM演示来发现专题的挑战,然后提出在PLM嵌入上建立的共同潜在空间学习和集群框架。在潜在空间、专题词和文件专题分布中,共同进行模型,以便发现的专题能够以一致和独特的术语来解释,同时作为文件的有意义的摘要。我们的模型有效地利用了PLMS带来的强大的代表性和超级语言特征来发现专题,并且比专题模型在概念上更为简单。在两个不同领域的基准数据模型上,我们的模型和基于更强有力的专题的自动展示提供了比更连贯的模型。