Partially recorded data are frequently encountered in many applications. In practice, such datasets are usually clustered by removing incomplete cases or features with missing values, or by imputing missing values, followed by application of a clustering algorithm to the resulting altered data set. Here, we develop clustering methodology through a model-based approach using the marginal density for the observed values, using a finite mixture model of multivariate $t$ distributions. We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set and makes a missing at random (MAR) assumption, as well as case deletion and imputation. Since only the observed values are utilized, our approach is computationally more efficient than imputation or full EM. Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation under various missingness mechanisms, and is more robust to extreme MAR violations than the full EM approach since it does not use the observed values to inform those that are missing. Our methodology is demonstrated on a problem of clustering gamma-ray bursts and is implemented in the https://github.com/emilygoren/MixtClust R package.
翻译:在许多应用中,经常会遇到部分记录的数据。在实践上,这类数据集通常通过消除缺失值的不完整案例或特征,或估算缺失值,然后对由此而来的修改数据集应用群集算法,然后对由此而来的修改数据集应用群集算法。在这里,我们通过模型法,利用观察到值的边际密度,利用多变美元分布的有限混合模型,开发群集方法。我们比较了我们的算法和相应的完全预期-最大化(EM)方法,这种方法考虑到不完整数据集中的缺失值,在随机(MAR)假设中造成缺失,以及案件删除和估算。由于只使用观察到的值,我们的方法在计算上比估算或完全EM。模拟研究表明,我们的方法比各种缺失机制下的案件删除和估算都更有利于真正群集分割的恢复,而且比整个EM方法更强,因为它没有使用观察到的值来告知缺失的。我们的方法是在计算伽玛射线阵阵阵阵列/Mimerob/com 中演示了我们的方法。