Model-based unsupervised learning, as any learning task, stalls as soon asmissing data occurs. This is even more true when the missing data are infor-mative, or said missing not at random (MNAR). In this paper, we proposemodel-based clustering algorithms designed to handle very general typesof missing data, including MNAR data. To do so, we introduce a mixturemodel for different types of data (continuous, count, categorical and mixed)to jointly model the data distribution and the MNAR mechanism, remainingvigilant to the degrees of freedom of each. Eight different MNAR modelswhich depend on the class membership and/or on the values of the missingvariables themselves are proposed. For a particular type of MNAR mod-els, for which the missingness depends on the class membership, we showthat the statistical inference can be carried out on the data matrix concate-nated with the missing mask considering a MAR mechanism instead; thisspecifically underlines the versatility of the studied MNAR models. Then,we establish sufficient conditions for identifiability of parameters of both thedata distribution and the mechanism. Regardless of the type of data and themechanism, we propose to perform clustering using EM or stochastic EMalgorithms specially developed for the purpose. Finally, we assess the nu-merical performances of the proposed methods on synthetic data and on thereal medical registry TraumaBase as well.
翻译:在任何学习任务发生时,一旦出现基于模型的无监督的学习,数据就会在任何学习任务发生时暂停。当缺失的数据是暂时的,或者说不是随机的(MNAR)时,这甚至更为正确。在本文中,我们建议采用基于模型的群集算算法,旨在处理非常一般性的缺失数据类型,包括MNAR数据。为了这样做,我们引入了不同类型数据(连续的、计数的、绝对的和混合的)的混合模型,以联合模拟数据分布和MNAR机制,保持对每个数据自由程度的警惕。然后,我们提出了八个不同的MINAR模型,这些模型取决于类成员以及/或缺失的变量本身的价值。对于某类的MNAR模型,我们提出了基于类成员缺失的模型算法。我们表明,统计推论可以在数据矩阵中进行,与缺失的掩码相连接,考虑一个MAR机制;这具体地强调了所研究的MNAR模型的多功能性。然后,我们建立了充分的条件,以便识别数据分布的参数,而我们又将数据类型和数学主题作为我们研发的模型,然后将数据类型,我们将数据类型,然后将数据类型和数学的合成的模型作为我们向最终的运行的运行的运行。