One of the applications of center-based clustering algorithms such as K-Means is partitioning data points into K clusters. In some examples, the feature space relates to the underlying problem we are trying to solve, and sometimes we can obtain a suitable feature space. Nevertheless, while K-Means is one of the most efficient offline clustering algorithms, it is not equipped to estimate the number of clusters, which is useful in some practical cases. Other practical methods which do are simply too complex, as they require at least one run of K-Means for each possible K. In order to address this issue, we propose a K-Means initialization similar to K-Means++, which would be able to estimate K based on the feature space while finding suitable initial centroids for K-Means in a deterministic manner. Then we compare the proposed method, DISCERN, with a few of the most practical K estimation methods, while also comparing clustering results of K-Means when initialized randomly, using K-Means++ and using DISCERN. The results show improvement in both the estimation and final clustering performance.
翻译:K- Means 等以中心为主的群集算法的应用之一是将数据点分割成 K- Means 。 在某些例子中, 特征空间与我们试图解决的根本问题有关, 有时我们可以得到合适的特征空间。 然而, K- Means 是最有效的离线群集算法之一, 然而, K- Means 是最有效的离线群集算法之一, 它不具备估算群集数量的能力, 在某些实际情况下是有用的。 其他实用方法过于复杂, 因为每个可能的 K 群集都需要至少一连K- Means 。 为了解决这一问题, 我们提议了 K- Means 初始化与 K- Means++ 相类似, 这样可以根据功能空间对 K 进行估算, 同时以确定性的方式为 K- Means 找到适合的初始中间体。 然后我们比较拟议的方法DISCERN, 与几个最实用的K 估计方法相比较, 同时比较K- Means 随机初始化时的组合结果, 使用 K- Means++ 和使用 和 DISCERN 。 。 的结果显示估计和最后组合性能的改进 。