大规模数据通用分布式群集框架 (A Generic Distributed Clustering Framework for Massive Data)

In this paper, we introduce a novel Generic distributEd clustEring frameworK (GEEK) beyond $k$-means clustering to process massive amounts of data. To deal with different data types, GEEK first converts data in the original feature space into a unified format of buckets; then, we design a new Seeding method based on simILar bucKets (SILK) to determine initial seeds. Compared with state-of-the-art seeding methods such as $k$-means++ and its variants, SILK can automatically identify the number of initial seeds based on the closeness of shared data objects in similar buckets instead of pre-specifying $k$. Thus, its time complexity is independent of $k$. With these well-selected initial seeds, GEEK only needs a one-pass data assignment to get the final clusters. We implement GEEK on a distributed CPU-GPU platform for large-scale clustering. We evaluate the performance of GEEK over five large-scale real-life datasets and show that GEEK can deal with massive data of different types and is comparable to (or even better than) many state-of-the-art customized GPU-based methods, especially in large $k$ values.

翻译：在本文中,我们引入了一种超越以美元为单位的集合方式处理大量数据的新颖的通用分配法(GEEEK) 。为了处理不同的数据类型, GEEEK首先将原始地貌空间的数据转换成统一格式的桶; 然后, 我们根据Similalar bucKet (SILK) 设计一种新的种子方法来确定初始种子。与最先进的播种方法相比, 如 $k$-UPU+及其变种, SILK 可以自动确定基于类似桶中共享数据对象的近距离而不是预先标出美元初始种子的数量。因此, 它的时间复杂性是独立于美元。由于这些精选的初始种子, GEEEK只需要一次性的数据来获得最后的种子。我们用分布式的CPU-GPUPU 平台来进行大规模集束。我们评估了 GEEEK 5 大型实时数据集的性能, 并显示GEEK 能够与不同类型和最可比的GPO(甚至更接近的G) 价格的大规模数据进行。