Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.
翻译:文本分组和专题提取是文字挖掘的两个重要任务。 通常, 这两项任务是分开执行的。 对于专题提取, 我们首先可以将文字分组和专题提取框架( ClusTop) 整合成一个统一的框架, 并同时实现高质量的分组结果和从每个组中提取专题。 但是, 这一天真的战略忽视了文本分组和专题提取紧密关联并遵循鸡与鸡与鸡之间的关系这一事实。 单独实施它们不能使它们相互受益, 以取得最佳的总体绩效。 在本文中, 我们建议建立一个不受监督的文本分组和专题提取框架( ClusTop), 将文本分组和专题提取纳入一个统一框架, 并同时从每个组中获取高质量的分组结果和专题提取。 我们的框架包括四个组成部分: 强化的语言模式培训、 减少维度、 组合和专题提取, 在那里可以将强化的语言模型视为组合和专题提取之间的桥梁。 一方面, 它提供与强大的分组结构相结合的文本整合, 从而便利有效的文本组合; 另一方面, 它在专题上高关注高端的集群结果, 我们的实验性语言框架 提供了两个相关的实验性框架,, 。