We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl.
翻译:我们为初始化(种子)提出了一种元方法,用于初始化(播种)以美元计值的集合算法,称为PNN-somoting。它包括将给定数据集分成为美元随机子集,将每个数据集单独分组,并将由此形成的集成与对近邻(PNN)方法合并。这是一种元方法,即当对单个子集进行组合时,可以使用任何播种算法。如果播种算法的计算复杂性在数据大小为美元线性,而组数为美元,则PNN-smoothing也几乎是线性,适当选择美元,在实践中具有相当的竞争力。我们从经验上显示,使用几种现有的种子方法,对若干合成和真实数据集进行测试,这一程序可以带来系统更好的成本。我们的实施在https://github.com/carlobaldassi/KMeansPNNSmoothing.jl上公开提供。