To evaluate clustering results is a significant part of cluster analysis. Since there are no true class labels for clustering in typical unsupervised learning, many internal cluster validity indices (CVIs), which use predicted labels and data, have been created. Without true labels, to design an effective CVI is as difficult as to create a clustering method. And it is crucial to have more CVIs because there are no universal CVIs that can be used to measure all datasets and no specific methods of selecting a proper CVI for clusters without true labels. Therefore, to apply a variety of CVIs to evaluate clustering results is necessary. In this paper, we propose a novel internal CVI -- the Distance-based Separability Index (DSI), based on a data separability measure. We compared the DSI with eight internal CVIs including studies from early Dunn (1974) to most recent CVDD (2019) and an external CVI as ground truth, by using clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. We also summarized the general process to evaluate CVIs and created the rank-difference metric for comparison of CVIs' results.
翻译:群集结果评估是群集分析的一个重要部分。 由于在典型的未经监督的学习中,没有真正分类的分类标签,许多使用预测的标签和数据的内部群集有效性指数(CVIs)已经创建。没有真正的标签,设计有效的CVI就难于创建群集方法。此外,必须拥有更多的CVI,因为没有通用的CVI可以用来测量所有数据集,没有为没有真实标签的群集选择适当的CVI的具体方法。因此,有必要应用各种CVI来评估群集结果。在本文件中,我们提议根据数据分离度测量,建立一个新的CVI -- -- 基于远程的分离性指数(DSI)。我们将DSI与八个内部的CVI(包括早期(1974年)到最新的CVD(2019年)的研究)和外部CVI(作为地面真相)进行比较,方法是在12个真实和97个合成数据集上将5个群集算算算结果组合在一起。结果显示DSI是一种有效、独特和竞争性的CVI(CVI)与其他CVI(CVI)相比的CVI)级比较结果。我们还总结了CVI(对比了CVI)的总体结果。