Missing values are common in many real-life datasets. However, most of the current machine learning methods can not handle missing values. This means that they should be imputed beforehand. Gaussian Processes (GPs) are non-parametric models with accurate uncertainty estimates that combined with sparse approximations and stochastic variational inference scale to large data sets. Sparse GPs can be used to compute a predictive distribution for missing data. Here, we present a hierarchical composition of sparse GPs that is used to predict missing values at each dimension using all the variables from the other dimensions. We call the approach missing GP (MGP). MGP can be trained simultaneously to impute all observed missing values. Specifically, it outputs a predictive distribution for each missing value that is then used in the imputation of other missing values. We evaluate MGP in one private clinical data set and four UCI datasets with a different percentage of missing values. We compare the performance of MGP with other state-of-the-art methods for imputing missing values, including variants based on sparse GPs and deep GPs. The results obtained show a significantly better performance of MGP.
翻译:缺失值在许多真实的数据集中是常见的。 但是, 目前机器学习方法大多无法处理缺失值。 这意味着它们应该事先估算。 Gossian processes (GPs) 是非参数模型, 具有准确的不确定性估计, 结合稀疏的近似值和随机变异推算尺度, 与大型数据集相加。 粗略的 GPs 可用于计算缺失数据的预测分布。 这里, 我们展示了稀疏的GPs的等级构成, 用来使用其他维度的所有变量预测每个维度的缺失值。 我们称之为缺少的GPs( GPPP) 。 MGPs 可以同时接受培训, 以估算所有观察到的缺失值。 具体地说, 它输出了每个缺失值的预测分布, 然后用于其他缺失值的估算。 我们在一个私人临床数据集中评估MGPs, 四个 UCIset数据集的缺失值百分比不同。 我们将MGP的性能与其他最先进的计算方法进行比较, 包括基于稀少的GPs 和深度GPs mPs获得的结果。