The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, and provide two fast versions for the direct optimization. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm.
翻译:组群结果的评估十分困难,高度取决于评估的数据集和持有者的观点。有许多不同的组群质量措施,它们试图提供验证组群结果的一般措施。一个非常受欢迎的措施是Silhouette。我们讨论Silhouette的高效基于类集变体,对其属性进行理论分析,并为直接优化提供两种快速版本。我们把原Silhouette的想法与众所周知的PAM算法及其最新改进结合起来。其中一个版本保证与原变种相同的结果,并提供了3000个样本和100美元的实际数据实验中,我们观察到与原PAMMEDSIL算法相比10464美元的速度。