There are various cluster validity indices used for evaluating clustering results. One of the main objectives of using these indices is to seek the optimal unknown number of clusters. Some indices work well for clusters with different densities, sizes, and shapes. Yet, one shared weakness of those validity indices is that they often provide only one optimal number of clusters. That number is unknown in real-world problems, and there might be more than one possible option. We develop a new cluster validity index based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points occupy. Our proposed index constantly yields several local peaks and overcomes the previously stated weakness. Several experiments in different scenarios, including UCI real-world data sets, have been conducted to compare the proposed validity index with several well-known ones. An R package related to this new index called NCvalid is available at https://github.com/nwiroonsri/NCvalid.
翻译:使用这些指数的主要目的之一是寻找最佳的未知群集数量。有些指数对密度、大小和形状不同的群集效果良好。然而,这些有效性指数的一个共同弱点是,它们往往只提供一个最佳群集数量。在现实世界问题中,这个数字并不为人所知,而且可能有一个以上可能的选择。我们根据两个点所占据的一对数据点与群集的中间距离之间的实际距离的相互关系,制定了一个新的群集有效性指数。我们提议的指数经常产生几个本地峰值并克服先前提到的弱点。在不同的假设中,已经进行了一些实验,包括UCI真实世界数据集,以将拟议的有效性指数与几个众所周知的群集进行比较。在https://github.com/nwiroonsri/NCvalid上可以找到与这个称为NCvalid的新指数有关的R组合。