The $k$-means algorithm (Lloyd's algorithm) is a widely used method for clustering unlabeled data. A key bottleneck of the $k$-means algorithm is that each iteration requires time linear in the number of data points, which can be expensive in big data applications. This was improved in recent works proposing quantum and quantum-inspired classical algorithms to approximate the $k$-means algorithm locally, in time depending only logarithmically on the number of data points (along with data dependent parameters) [q-means: A quantum algorithm for unsupervised machine learning, Kerenidis, Landman, Luongo, and Prakash, NeurIPS 2019; Do you know what $q$-means?, Cornelissen, Doriguello, Luongo, Tang, QTML 2025]. In this work, we describe a simple randomized mini-batch $k$-means algorithm and a quantum algorithm inspired by the classical algorithm. We demonstrate that the worst case guarantees of these algorithms can significantly improve upon the bounds for algorithms in prior work. Our improvements are due to a careful use of uniform sampling, which preserves certain symmetries of the $k$-means problem that are not preserved in previous algorithms that use data norm-based sampling.
翻译:$k$-均值算法(Lloyd算法)是一种广泛使用的无标签数据聚类方法。该算法的一个关键瓶颈在于每次迭代所需时间与数据点数量呈线性关系,这在大数据应用中可能代价高昂。近期研究通过提出量子算法及量子启发的经典算法,在仅与数据点数量对数相关的时间复杂度内(同时包含数据依赖参数)局部近似$k$-均值算法,对此进行了改进[q-means: 一种无监督机器学习的量子算法, Kerenidis, Landman, Luongo, and Prakash, NeurIPS 2019; Do you know what $q$-means?, Cornelissen, Doriguello, Luongo, Tang, QTML 2025]。本文提出一种简单的随机小批量$k$-均值算法及其启发的量子算法。我们证明这些算法在最坏情况下的性能保证能显著超越先前工作的算法边界。这一改进源于对均匀采样的精心运用,该采样方式保留了$k$-均值问题的特定对称性,而先前基于数据范数采样的算法未能保持这些对称性。