Text clustering, as one of the most fundamental challenges in unsupervised learning, aims at grouping semantically similar text segments without relying on human annotations. With the rapid development of deep learning, deep clustering has achieved significant advantages over traditional clustering methods. Despite the effectiveness, most existing deep text clustering methods rely heavily on representations pre-trained in general domains, which may not be the most suitable solution for clustering in specific target domains. To address this issue, we propose CEIL, a novel Classification-Enhanced Iterative Learning framework for short text clustering, which aims at generally promoting the clustering performance by introducing a classification objective to iteratively improve feature representations. In each iteration, we first adopt a language model to retrieve the initial text representations, from which the clustering results are collected using our proposed Category Disentangled Contrastive Clustering (CDCC) algorithm. After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information to update the language model with the classification objective via a prompt learning approach. Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration. Extensive experiments demonstrate that the CEIL framework significantly improves the clustering performance over iterations, and is generally effective on various clustering algorithms. Moreover, by incorporating CEIL on CDCC, we achieve the state-of-the-art clustering performance on a wide range of short text clustering benchmarks outperforming other strong baseline methods.
翻译:文本聚类作为无监督学习领域中最基本的挑战之一,旨在将语义相似的文本分组而无需依赖人工注释。随着深度学习的快速发展,深度聚类在传统聚类方法上已经取得了显著的优势。尽管其效果显著,但是大多数现有的深度文本聚类方法过于依赖于通用领域预训练模型的表示能力,这些模型可能并不是特定领域的最佳解决方案。为了解决这个问题,我们提出了 CEIL (Classification-Enhanced Iterative Learning) 框架,它是一种新型的针对短文本聚类的分类增强迭代式学习框架,旨在通过引入分类目标来迭代式改进特征表示,从而有助于提高聚类性能。在每次迭代中,我们首先采用语言模型来检索初步的文本表示,然后使用我们提出的类别不可分的对比聚类算法 (CDCC) 收集聚类结果。在经过严格的数据筛选和聚合处理后,我们检索到具有干净类别标签的样本,它们可以作为监督信息通过提示式学习方法来更新具有分类目标的语言模型。最后,使用具有改进表示能力的更新后的语言模型来增强下一个迭代中的聚类。广泛的实验表明,CEIL 框架显著提高了聚类性能,并且对于各种聚类算法都具有通用性。此外,将 CEIL 融合到 CDCC 中,我们在广泛的短文本聚类基准测试中实现了最新颖的聚类性能,优于其他强基线方法。