Given $n$ points in $\ell_p^d$, we consider the problem of partitioning points into $k$ clusters with associated centers. The cost of a clustering is the sum of $p^{\text{th}}$ powers of distances of points to their cluster centers. For $p \in [1,2]$, we design sketches of size poly$(\log(nd),k,1/\epsilon)$ such that the cost of the optimal clustering can be estimated to within factor $1+\epsilon$, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly$(\log(nd),k,1/\epsilon)$. We also obtain a distributed memory algorithm, where the $n$ points are arbitrarily partitioned amongst $m$ machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on $d$ for $p \in [1,2)$.
翻译:考虑到美元=1美元,我们考虑将点数分割成与相关中心相联的千美元集群的问题。集群的成本是点点距离到集聚中心的能量总和。对于[1,2美元],我们设计了大小为聚美元(美元=log(nd),k,1/\epsilon)的草图,这样,最佳集聚的成本就可以在系数1 ⁇ psilon范围内估算,尽管压缩代表器没有包含足够的信息来恢复集聚中心或集聚点。这导致以空间聚聚(美元)(k,1/\epsilon)美元估算聚集成本的流算法。我们还获得了分布式的记忆算法,其中美元点被任意分割在每百万美元机器之间,每个单位都向中央方发送信息,然后计算聚集成本的近似值。在这项工作之前,没有知道这种流流或分布式算法需要$d$=1美元($=1)的亚线依赖美元。