通过对未受训练语言示范语言代表进行远程空间集群,发现 (Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations)

Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.

翻译：专题模型是文本公司自动发现专题的突出工具,尽管具有效力,但专题模型也有一些局限性,包括无法在文件中进行文字排序的模拟,难以纳入外部语言知识,以及缺乏准确和高效的推论方法来接近棘手的后遗物。最近,由于文本的优异表述,预先培训的语言模型(PLMS)使大量任务取得了惊人的绩效改进。有趣的是,没有标准的方法将PLMS用于专题发现,作为专题发现更好的替代模式。在本文中,我们首先分析使用PLM演示来发现专题的挑战,然后提出在PLM嵌入上建立的共同潜在空间学习和集群框架。在潜在空间、专题词和文件专题分布中,共同进行模型,以便发现的专题能够以一致和独特的术语来解释,同时作为文件的有意义的摘要。我们的模型有效地利用了PLMS带来的强大的代表性和超级语言特征来发现专题,并且比专题模型在概念上更为简单。在两个不同领域的基准数据模型上,我们的模型和基于更强有力的专题的自动展示提供了比更连贯的模型。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【如何做研究】How to research ，22页ppt

专知会员服务

112+阅读 · 2021年4月17日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日