We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing $k$-means++ seeding proves superior in both effectiveness and speed compared to the popular "greedy" $k$-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl.
翻译:我们提出了一个初始化(种子) $k$ 比例组算法的元方法,称为 PNN- Smooting 。 它包括将给定的数据集分成为$美元随机子集,将每个数据集单独组合起来,并将由此形成的集群与对近近邻(PNN) 方法合并。 这是一种元方法, 即当将单个子集组合成任何种子算法时, 可以使用任何种子算法。 如果该种子算法的计算复杂性在数据大小为$美元和美元组数的线性上线性, PNN- smothing也几乎是线性的数据, 适当选择$, 并具有相当的实践竞争力。 我们从经验上展示了使用几种现有种子方法并测试若干合成和真实数据集,这种程序可以带来系统更好的成本。 特别是, 我们的提高 $k$- 比例 比例 + s seeding 方法在效力和速度上都证明与流行的“ greedy” $k$- poines+ang NS++ 变量相比, PNN- smoth) 也几乎是线性的, 我们的实施工作可在 httpsmargal/ mas. gas/ masslining.