In projective clustering we are given a set of n points in $R^d$ and wish to cluster them to a set $S$ of $k$ linear subspaces in $R^d$ according to some given distance function. An $\eps$-coreset for this problem is a weighted (scaled) subset of the input points such that for every such possible $S$ the sum of these distances is approximated up to a factor of $(1+\eps)$. We suggest to reduce the size of existing coresets by suggesting the first $O(\log(m))$ approximation for the case of $m$ lines clustering in $O(ndm)$ time, compared to the existing $\exp(m)$ solution. We then project the points on these lines and prove that for a sufficiently large $m$ we obtain a coreset for projective clustering. Our algorithm also generalize to handle outliers. Experimental results and open code are also provided.
翻译:在投影组群中,我们得到一套n点,以美元为单位,希望根据某些特定的距离函数,将它们分组成一套以美元为单位的以美元为单位的线性子空间。这个问题的美元-核心数是输入点的加权(缩放)子集,这样,对于每一个可能的S美元,这些距离的总和大约等于1美元。我们建议缩小现有核心集的规模,建议用美元(log(m)美元)为单位以美元(ndm)为单位的直线子空间提供第一个$($)近似值,而以美元(ndm)为单位的直线集则以美元(ndm)为单位,而现有的是美元(m)溶解。我们然后将这些点投放到这些线上,并证明对于足够大的美元,我们为投影集组获得了一个核心集。我们的算法也笼统地处理外层。还提供了实验结果和开源代码。