The k-means++ algorithm is the de-facto standard for finding approximate solutions to the k-means problem. A widely used implementation is provided by the scikit-learn Python package for machine learning. We propose the breathing k-means algorithm, which on average significantly outperforms scikit-learn's k-means++ w.r.t. both solution quality and execution speed. The initialization step in the new method is done by k-means++ but without the usual (and costly) repetitions (ten in scikit-learn). The core of the new method is a sequence of "breathing cycles," each consisting of a "breathe in" step where the number of centroids is increased by m and a "breathe out" step where m centroids are removed. Each step is ended by a run of Lloyd's algorithm. The parameter m is decreased until zero, at which point the algorithm terminates. With the default (m = 5), breathing k-means dominates scikit-learn's k-means++. This is demonstrated via experiments on various data sets, including all those from the original k-means++ publication. By setting m to smaller or larger values, one can optionally produce faster or better solutions, respectively. For larger values of m, e.g., m = 20, breathing k-means likely is the new SOTA for the k-means problem.
翻译:k- means++ 算法是寻找 k- point 问题近似解决方案的 defacto 标准 。 由 scikit- learn Python 软件包为机器学习提供广泛使用的执行 。 我们提出呼吸 kpoys 算法, 平均明显优于 scikit- learn k- moys++ w.r. t. 的解决方案质量和执行速度。 新方法的初始化步骤由 k- poys++ 完成, 但没有通常的( 10 scikit- learn ) 重复( 10 ) 。 新方法的核心是“ 呼吸周期” 的序列 。 我们建议使用 呼吸 k- points 算法, 平均优于 sikit- modal 的“ breathe mreathe mologies” 。 参数 mreax mile male lax to new, oral due orals to the new rudeal rudeal rudeal- klives.