We answer the question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch $k$-means?". Specifically, we consider mini-batch $k$-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold. Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative and show that if the batch is of size $\tilde{\Omega}((d/\epsilon)^2)$, it must terminate within $O(d/\epsilon)$ iterations with high probability, where $d$ is the dimension of the input, and $\epsilon$ is a threshold parameter for termination. This is true regardless of how the centers are initialized. When the algorithm is initialized with the $k$-means++ initialization scheme, it achieves an approximation ratio of $O(\log k)$ (the same as the full-batch version). Finally, we show the applicability of our results to the mini-batch $k$-means algorithm implemented in the scikit-learn (sklearn) python library.
翻译:摘要:我们回答了问题:"小批量k均值聚类中的局部进展(在批处理上)是否意味着全局进展(在整个数据集上)?". 具体而言,我们考虑了只有在采样批次中聚类质量的改进低于某个阈值时才终止的小批量k均值聚类。虽然乍一看该算法可能会无限执行,但我们肯定答案并表明,如果样本批次的大小为$\tilde{\Omega}((d/\epsilon)^2)$,则它必须在高概率下在$O(d/\epsilon)$次迭代内终止,其中$d$是输入的维度,$\epsilon$是终止的阈值参数。这对于中心如何初始化都是正确的。当使用$k$-means++初始化方案初始化算法时,它达到了$O(\log k)$的近似比(与完整批次版本相同)。最后,我们展示了我们的结果适用于scikit-learn(sklearn)python库中实现的小批量k均值聚类算法。