Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.
翻译:由连续变量和绝对变量组成的可变数据观测混合类型数据构成新的挑战; 这些挑战中最突出的是对数据选择最合适的组群方法; 本文件就组群恢复性能而言,对混合类型数据的八种远距离分区方法进行比较基准研究; 由全因子设计进行的一系列模拟,审查了各种因素对组群恢复的影响; 组群重叠的数量、数据集中绝对变量的百分比、组群数量和观测数量对组群恢复和大多数试验情景的影响最大; KAMILA、K-Prototype和序列要素分析以及K-Means群群通常比其他方法效果更好; 研究可以作为从业人员选择最适当方法的有用参考。