Unsupervised clustering aims at discovering the semantic categories of data according to some distance measured in the representation space. However, different categories often overlap with each other in the representation space at the beginning of the learning process, which poses a significant challenge for distance-based clustering in achieving good separation between different categories. To this end, we propose Supporting Clustering with Contrastive Learning (SCCL) -- a novel framework to leverage contrastive learning to promote better separation. We assess the performance of SCCL on short text clustering and show that SCCL significantly advances the state-of-the-art results on most benchmark datasets with 3%-11% improvement on Accuracy and 4%-15% improvement on Normalized Mutual Information. Furthermore, our quantitative analysis demonstrates the effectiveness of SCCL in leveraging the strengths of both bottom-up instance discrimination and top-down clustering to achieve better intra-cluster and inter-cluster distances when evaluated with the ground truth cluster labels.
翻译:无人监督的集群旨在根据代表空间中测量的某些距离来发现数据中的语义类别,然而,在学习过程开始时,不同类别在代表空间中往往相互重叠,这对远程集群实现不同类别之间的良好分离构成重大挑战。为此,我们提议支持以相互抵触的学习(SCCL)组合 -- -- 利用对比式学习促进更好的分离的新框架。我们评估SCCL在短文本集群上的绩效,并表明SCCL大大改进了大多数基准数据集的最新结果,即准确性提高3%-11%,正常化相互信息改善4%-15%。此外,我们的数量分析表明SCL在利用自下而上的歧视和自下而下而下的集群的优势,在用地面真相集群标签评价时,能够有效地提高集群内和集群之间的距离。