Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-center variant which, given a set $S$ of points from some metric space and a parameter $k<|S|$, requires to identify a subset of $k$ centers in $S$ minimizing the maximum distance of any point of $S$ from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $S$ (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed $\epsilon>0$, the algorithms yield solutions whose approximation ratios are a mere additive term $\epsilon$ away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) $D$. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones.
翻译:以中心为主的群集是数据分析的基本原始,对于大型数据集来说非常具有挑战性。 在本文中,我们关注流行的美元中位变量,考虑到从某些公制空间和参数$k ⁇ S $的一组美元点数和参数$k ⁇ S $$$,需要确定一个以美元为单位的子集,以美元为单位,最大限度地减少从最接近中心的任何点到S美元的最大距离。更笼统的配方,用于处理吵闹的数据集,再设置一个z美元参数,允许在计算与中心的最大距离时忽略高达z$S(离线)的美元点数(离线)美元。我们为上述两种问题配方设定了基于核心设置的2回合地图降价算法,需要用美元为离线者确定一个一等量的计算法中心。对于任何固定的 $Moblusl>0, 算法产生解决方案的精确比率比已知最接近的多的多亿次序列算法要快得多, 由此大大地改进了我们目前测算结果的精确的精确度, 而算算的精确的精度也显示了我们所测算的精度的精度, 的精度的精度的精度, 的精度也显示的精度的精度数据的精度比的精度的精度, 的精度的精度的精度也显示的精度的精度比的精度的精度的精度的精度的精度, 。