We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at https://github.com/selotape/distributed_k_means.
翻译:我们提议在分布式环境中对 k 手段分组采用新的算法, 数据分布在多个机器之间, 并由一名协调员与这些机器进行通信, 以计算输出组。 我们的算法保证成本近似系数和一些仅取决于协调员计算能力的通信周期。 此外, 算法包括一个内置停止机制, 允许它尽可能使用较少的通信周期。 我们从理论上和经验上都显示, 许多自然案例, 确实有1至4轮就足够了。 与流行的 k 手段算法相比, 我们的方法允许利用更大的协调员能力获得较少的回合。 我们的实验显示, 提议的算法获得的 k 手段成本通常比 k- 手段获得的成本要好, 即使后者允许进行更多回合。 此外, 我们的方法中的机器运行时间比 k- means 的比例要小得多。 运行算法和实验的代码可以在 https://github. com/selotape/ distrated_ maless 。