Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set $P$ of points from a metric space and a parameter $k<|P|$, requires to determine a subset $S$ of $k$ centers minimizing the sum of all squared distances of points in $P$ from their closest center. A more general formulation, known as k-means with $z$ outliers, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $P$ (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term $O(\gamma)$ away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where $\gamma$ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.
翻译:中心基群是不受监督的学习和数据分析的关键原始。 一个流行的变种无疑是k means问题,考虑到从一个计量空间和一个参数 $k ⁇ P $的一组美元点数和一个参数 $k ⁇ P $ 美元,它需要确定一个子集美元中心以美元为单位,将所有点的平方距离与最接近的中心以美元计算。一个更笼统的配方,称为k means,以美元为单位,用美元为单位,将所有点的平方距离与最近的中心以美元为单位。一个更笼统的配方,称为k means,用美元为单位,用美元,以美元为单位,以美元为单位,以美元为单位,以美元为单位,用美元为单位,以美元为单位,以美元为单位,以美元为单位,用美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,用美元为单位,以美元为单位,以美元为单位,用美元为单位,以美元为单位,以美元为单位,在计算一个最已知的连续(可能达到单位,以美元为单位,以美元为单位,以美元为单位为单位算算算算算算,用美元为单位,用美元为单位,用美元为单位,以美元为单位,用美元为单位,用美元为单位,用美元为单位的内,以美元为单位,以美元为单位算算算算算算算算算算算算算算算算算算算算算。