This paper presents a thorough evaluation of the existing methods that accelerate Lloyd's algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. UniK embraces a class of well-known methods and enables a fine-grained performance breakdown. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.
翻译:本文对加快劳埃德快速K- means群集算法的现有方法进行了透彻的评估。 为此,我们分析了现有方法的裁剪机制,并将其共同管道归纳成一个统一的评价框架UniK。 UniK采用一类众所周知的方法,并能够细化性能分解。在UniK中,我们用多个性能指标对若干数据集进行彻底评估现有方法的利弊。此外,我们从UniK中获得了一种优化的算法,它有效地将多种现有方法混合起来,以进行更具侵略性的裁剪。为了进一步,我们调查是否可以通过机器学习自动选择一个特定集任务的最有效方法,使实践者和研究人员受益。