Partially recorded data are frequently encountered in many applications and usually clustered by first removing incomplete cases or features with missing values, or by imputing missing values, followed by application of a clustering algorithm to the resulting altered dataset. Here, we develop clustering methodology through a model-based approach using the marginal density for the observed values, assuming a finite mixture model of multivariate $t$ distributions. We compare our approximate algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set and makes a missing at random (MAR) assumption, as well as case deletion and imputation methods. Since only the observed values are utilized, our approach is computationally more efficient than imputation or full EM. Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation under various missingness mechanisms, and is at least competitive with the full EM approach, even when MAR assumptions are violated. Our methodology is demonstrated on a problem of clustering gamma-ray bursts and is implemented at https://github.com/emilygoren/MixtClust.
翻译:部分记录的数据在许多应用中经常遇到,通常通过首先删除缺失值的不完整案例或特征,或计算缺失值,然后对由此而来的修改数据集应用群集算法。这里,我们通过使用观察到值的边际密度的模型方法制定群集方法,假设多变美元分布的有限混合模型。我们比较了我们的大致算法,以相应的完全预期-最大化(EM)方法为基础,该方法考虑到不完整数据集中的缺失值,在随机(MAR)假设中导致缺失值,以及案件删除和估算方法。由于只使用观察到的数值,我们的方法在计算上比估算或完全EM效率更高。模拟研究表明,我们的方法比各种缺失机制下的案件删除和估算都有利地恢复了真正的群集分布,而且即使MAR假设被违反,也至少与整个EM方法具有竞争力。我们的方法显示成群集伽玛射线暴的问题,并在https://github.com/emilygoren/MixtClust实施。