This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be not robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or come from a non-Gaussian distributions. To overcome this issue, a new expectation maximization algorithm is investigated for mixtures of elliptical distributions with the nice property of handling potential missing data. The complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework thanks to its conditional distribution, which is shown to be a Student distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
翻译:本文解决了对噪音数据和非高加索数据缺少数据估算的问题。古典估算方法,即高斯混合模型的预期最大化算法(EM),与其他流行方法相比,例如基于K-近邻或由链式方程式进行多重估算的方法相比,显示了有趣的特性。然而,据知高斯混合模型对于各种数据并不可靠,这可能导致在数据受到外部线的污染或来自非高加索分布的数据时,估计性能较差。为了克服这一问题,将调查对流星分布与处理潜在缺失数据的良好属性的混合进行新的预期最大化算法。与流星分布混合物相关的完整数据可能性由于有条件分布而完全适应了EM框架,这表现为学生分布。合成数据的实验结果显示,当数据被外部线污染或来自非高加索分布时,拟议的算法对外星数据使用得不好。此外,在现实世界数据元数据集上进行的实验表明,与其他古典方法相比,这种算法具有很高的竞争力。