Datasets with missing values are very common in real world applications. GAIN, a recently proposed deep generative model for missing data imputation, has been proved to outperform many state-of-the-art methods. But GAIN only uses a reconstruction loss in the generator to minimize the imputation error of the non-missing part, ignoring the potential category information which can reflect the relationship between samples. In this paper, we propose a novel unsupervised missing data imputation method named PC-GAIN, which utilizes potential category information to further enhance the imputation power. Specifically, we first propose a pre-training procedure to learn potential category information contained in a subset of low-missing-rate data. Then an auxiliary classifier is determined based on the synthetic pseudo-labels. Further, this classifier is incorporated into the generative adversarial framework to help the generator to yield higher quality imputation results. The proposed method can significantly improve the imputation quality of GAIN. Experimental results on various benchmark datasets show that our method is also superior to other baseline models.
翻译:缺少值的数据集在现实世界应用中非常常见。 GAIN是最近提出的缺失数据估算的深重基因模型,已被证明优于许多最先进的方法。 但是, GAIN只使用发电机的重建损失来尽量减少未漏部分的估算错误,而忽略了可能反映样本之间关系的潜在类别信息。 在本文中,我们提出了一个名为 PC-GAIN的新颖的未经监督的缺失数据估算方法,它利用潜在的类别信息来进一步加强估算能力。 具体地说,我们首先提议了一个培训前程序,以学习低传出率数据组所含的潜在类别信息。然后根据合成假标签确定一个辅助分类器。此外,这一分类器被纳入了基因对抗框架,以帮助发电机产生更高质量的估算结果。拟议方法可以大大改进GAIN的估算质量。 各种基准数据集的实验结果表明,我们的方法也优于其他基线模型。