This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters. K-splits starts from a small number of clusters and uses the most significant data distribution axis to split these clusters incrementally into better fits if needed. Accuracy and speed are two main advantages of the proposed method. We experiment on six synthetic benchmark datasets plus two real-world datasets MNIST and Fashion-MNIST, to prove that our algorithm has excellent accuracy in finding the correct number of clusters under different conditions. We also show that k-splits is faster than similar methods and can even be faster than the standard k-means in lower dimensions. Finally, we suggest using k-splits to uncover the exact position of centroids and then input them as initial points to the k-means algorithm to fine-tune the results.
翻译:本文引入了k-splits, 这是一种基于 k- 比例的改进等级算法, 其依据是 K- 比例数据组数据, 且未事先知道组群数量 。 K- 位分法起源于少数组群, 并使用最重要的数据分布轴, 以便在必要时逐步将这些组群分割为更合适 。 准确性和速度是拟议方法的两个主要优点 。 我们实验了六个合成基准数据集, 加上两个现实世界数据集 MNIST 和时装- MNIST, 以证明我们的算法在发现不同条件下的组群的正确数量时精准性 。 我们还显示 k- 比例比相似的方法要快, 甚至可以比标准 K- 比例的低维度速度更快 。 最后, 我们建议使用 k- 比例 来发现小行星的准确位置, 然后将其输入到 K- 比例算法的初始点, 以微调结果 。