Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data driven method that selects the best specification of the model and the optimal number of clusters for a given dataset "on the fly". Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.
翻译:组合的混合数据是变量非常多样化性质所固有的众多挑战。 组合算法应该能够(尽管这种差异性)从变量中提取不同的信息碎片,以便设计组。 在这项工作中,我们引入了多层结构模型基组法,称为“ 混合深层混集模型(MDMM) ” (MDMM), 这可以被视为将连续和非连续数据分别进行组集的自动合并方式。 这个结构是灵活的,可以适用于混合数据以及连续或非连续的数据。 从这个意义上讲,我们将通用线性边端变量模型和深高山混集模型和深高山混集模型进行概括。 我们还设计了新的初始化战略和数据驱动方法,为“ 苍蝇” 的给定数据集选择模型的最佳规格和最佳组群数。 此外,我们的模型为数据提供了连续的低维度表达方式,这可以成为对混合数据集进行可视化的有用工具。 最后,我们验证了我们方法的绩效,将结果与一些常用的状态混合数据组合模型进行比较。