Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
翻译:聚类是机器学习中的一个主要研究课题,近年来已经成功地应用到深度学习中。然而,现有深度聚类方法没有解决的一个方面是如何高效地为给定数据集产生多个不同的划分。这一点特别重要,因为多样化的基础聚类是共识聚类的必要条件。共识聚类已被证明比单个聚类更具鲁棒性,能够产生更好的结果。为了解决这个问题,我们提出了DivClust,一种可以将多样性控制损失纳入到现有的深度聚类框架中,以便产生多个带有所需多样性的聚类。我们对多个数据集和深度聚类框架进行了实验,并表明:a)我们的方法可以有效地在各种框架和数据集中控制多样性,而且计算成本非常小;b)DivClust学习到的聚类集合包含显著优于单个聚类基线的解决方案;c)使用一个即用型共识聚类算法,DivClust产生的共识聚类解决方案始终优于单个聚类基线,从而有效地提高了基础深度聚类框架的性能。