高效缺失数据估算算法框架 (Principal Component Analysis based frameworks for efficient missing data imputation algorithms)

Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).

翻译：缺少的数据在实践中是一个常见的问题。许多估算方法都是为了填充缺失的条目而开发的。然而,并非所有这些估算方法都能缩进高维数据,特别是多重估算技术。与此同时,目前的数据趋向于高维。因此,在这项工作中,我们提议根据主元组成部分分析(PCA)建立一个简单但多功能的框架,即主元组成部分分析(PCAI),以加快估算过程并缓解许多现有估算技术的记忆问题,同时不牺牲多功能计算系统的估算质量。此外,即使某些或所有缺失的特征是绝对的,或者缺失的特征数量很大,这些框架也都可以使用。接下来,我们引入CARC Imputation-分类(PIC),这是对分类问题的一种应用,并作了一些调整。我们通过对各种情景的实验来验证我们的做法,这些实验表明,CAPI和PIC可以使用各种估算算法,包括最新算法,并显著提高估算速度,同时在直接估算数据方面实现竞争性的平均平均错误/分类准确性(imple)。