Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).
翻译:缺失数据是实践中经常遇到的问题。许多插补方法已被开发出来用于填补缺失的数据条目。然而,并非所有的方法都能够适用于高维数据,特别是多重插补技术。同时,现代数据趋于高维。因此,在本文中,我们提出了主成分分析插补(PCAI)算法,这是一个基于主成分分析的简单而通用的框架,用于加速插补过程并减轻许多现有插补技术的内存问题,而不会牺牲MSE等方面的插补质量。此外,即使存在某些或全部缺失特征是分类属性,或缺失特征数量较大的情况下,也可以使用此框架。接下来,我们介绍PCA插补-分类(PIC),这是对PCA的应用,用于对分类问题进行一些调整。我们通过对各种场景的实验验证了我们的方法,结果表明PCAI和PIC可以与各种插补算法一起使用,包括最先进的算法,大大提高了插补速度,同时与直接插补(即直接在缺失数据上插补)相比,在平均平方误差/分类准确率等方面实现了竞争性效果。