基于主成分分析的高效缺失数据填充算法框架 (Principal Component Analysis based frameworks for efficient missing data imputation algorithms)

Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).

翻译：缺失数据是实践中经常遇到的问题。许多插补方法已被开发出来用于填补缺失的数据条目。然而，并非所有的方法都能够适用于高维数据，特别是多重插补技术。同时，现代数据趋于高维。因此，在本文中，我们提出了主成分分析插补（PCAI）算法，这是一个基于主成分分析的简单而通用的框架，用于加速插补过程并减轻许多现有插补技术的内存问题，而不会牺牲MSE等方面的插补质量。此外，即使存在某些或全部缺失特征是分类属性，或缺失特征数量较大的情况下，也可以使用此框架。接下来，我们介绍PCA插补-分类（PIC），这是对PCA的应用，用于对分类问题进行一些调整。我们通过对各种场景的实验验证了我们的方法，结果表明PCAI和PIC可以与各种插补算法一起使用，包括最先进的算法，大大提高了插补速度，同时与直接插补（即直接在缺失数据上插补）相比，在平均平方误差/分类准确率等方面实现了竞争性效果。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

66+阅读 · 2023年2月15日

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【经典书】数据挖掘：理论、算法与示例，347页pdf，Nong Ye，Arizona State University

专知会员服务

82+阅读 · 2020年2月27日