Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering.
翻译:9种流行的集群方法适用于42个实际数据集。目的是通过若干组群验证指数对方法进行详细定性,这些指数衡量由此产生的集群的个别方面,如Hennig(2019年)引进的小型集群内距离、集群分离、接近高斯分布等。30个数据集带有“真实”集群。在这些数据中,组群与9个组群的相似性得到了探讨。此外,一种混合效应回归将组群的可观察个别方面与“真实”集群的相似性联系起来,在实际集群问题中,“真实”集群是无法观察的。这项研究不仅对发现“真实”集群的方法的能力有了新的洞察力,而且对从方法中可以预期到的集群的特性有了新的洞察力,这对于在没有给定“真实”集群的情况下选择一种方法至关重要。