We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
翻译:我们提出一个简单有效的高维数据集集集方法,其中含有大量的组群。我们的算法通过评估数据点距离与一组集集中心之间的距离而取得高性能。我们的贡献比k手段效率高得多,因为它并不要求对所有数据点和组群进行全部比较。我们表明,我们近似的最佳解决办法与确切的解决办法相同。然而,我们的方法在提取这些组群时比最先进的方法效率要高得多。我们比较了我们与精确的k手段和一系列标准化组群任务的其他近似方法的近似法。关于评估,我们考虑了算法的复杂性,包括达到趋同的操作数量,以及结果的稳定性。