Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture the complex clustering structure hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than ten thousand. Moreover, it is known that convex clustering sometimes fails to produce hierarchical clustering structures. This undesirable phenomenon is called cluster split and makes it difficult to interpret clustering results. In this paper, we propose convex clustering through majorization-minimization (CCMM) -- an iterative algorithm that uses cluster fusions and sparsity to enforce a complete cluster hierarchy with reduced memory usage. In the CCMM algorithm, the diagonal majorization technique makes a highly efficient update for each iteration. With a current desktop computer, the CCMM algorithm can solve a single clustering problem featuring over one million objects in seven-dimensional space within 70 seconds.
翻译:混凝土组群是一种现代方法,既具有等级性,又具有美元值组群特性。 虽然混凝土组群可以捕捉数据中隐藏的复杂组群结构,但现有的混凝土组群算法无法对抽样大小大于一万的大型数据集进行缩放。此外,众所周知,混凝土组群有时不能产生等级组群结构。这种不受欢迎的现象被称为聚集分解,难以解释组群结果。在本文中,我们提议通过主控-最小化(CCMM)来进行卷集群群群集,这是一种迭代算法,利用集聚集聚集和宽度来实施完全的群集层结构,减少内存的使用。在CCM算法中,对角组群集法对每一次变种都作了高效的更新。在目前的台式计算机中,CCM算法可以解决一个单一组群集问题,在70秒内将超过100万个物体放在七维空间内。