Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. Two major difficulties lie in the initialisation of the algorithms and in making variables comparable between types. This work is concerned with these two problems. We introduce a two-heads architecture model-based clustering method called Mixed data Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clusterings performed separately on continuous and non continuous data. We also design a new initialisation strategy and a data driven method that selects "on the fly" the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets
翻译:组合的混合数据是变量非常多样化性质所固有的许多固有挑战。两个主要困难在于算法的初始化和使不同类型之间的变量具有可比性。这项工作涉及这两个问题。我们引入了双头结构模型模型组合模型方法,称为“混合数据深海高斯混合混合模型(MDMM) ” (MDMM),这可以被视为将连续和非连续数据分别执行的集群自动合并的一种方式。我们还设计了新的初始化战略和数据驱动方法,选择“在飞行中”选择模型的最佳规格和给定数据集的最佳组数。此外,我们的模型提供了数据连续的低维度表达方式,这可以成为将混合数据集可视化的有用工具。最后,我们验证了我们将其结果与一些常用数据集的最新混合数据组合模型进行比较的绩效。