Partially recorded data are frequently encountered in many applications. In practice, such datasets are usually clustered by removing incomplete cases or features with missing values, or by imputing missing values, followed by application of a clustering algorithm to the resulting altered data set. Here, we develop clustering methodology through a model-based approach using the marginal density for the observed values, assuming a finite mixture model of multivariate $t$ distributions. We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set and makes a missing at random (MAR) assumption, as well as case deletion and imputation methods. Since only the observed values are utilized, our approach is computationally more efficient than imputation or full EM. Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation under various missingness mechanisms, and is more robust to extreme MAR violations than the full EM approach which we surmise is because it does not use the observed values to inform those that are missing. Our methodology is demonstrated on a problem of clustering gamma-ray bursts and is implemented at \url{https://github.com/emilygoren/MixtClust}.
翻译:在许多应用中,经常会遇到部分记录的数据。在实践上,这类数据集通常通过消除缺失值的不完整案例或特征,或估算缺失值,然后对由此而来的修改数据集应用群集算法,然后对由此而来的修改数据集应用群集算法。这里,我们通过模型法,利用观察到值的边际密度,开发群集方法,假设多变美元分布的有限混合模型;我们比较我们的算法,以相应的完全预期-最大化(EM)方法为基础,该方法考虑到不完整数据集中的缺失值,在随机(MAR)假设中造成缺失值,以及案件删除和估算方法。由于只使用观察到的值,我们的方法在计算上比估算或完全EM。模拟研究表明,我们的方法比在各种缺失机制下删除和估算值得到的精确密度,更有利于真正群集分割的恢复,而且比我们推测的完全EM方法更强,因为它没有使用观察到的值来告知缺失的值。我们的方法在Gam-ray/Amirmirmurx/argresbrma/musmurxlevrus。