Missing data are a concern in many real world data sets and imputation methods are often needed to estimate the values of missing data, but data sets with excessive missingness and high dimensionality challenge most approaches to imputation. Here we show that appropriate feature selection can be an effective preprocessing step for imputation, allowing for more accurate imputation and subsequent model predictions. The key feature of this preprocessing is that it incorporates uncertainty: by accounting for uncertainty due to missingness when selecting features we can reduce the degree of missingness while also limiting the number of uninformative features being used to make predictive models. We introduce a method to perform uncertainty-aware feature selection (UAFS), provide a theoretical motivation, and test UAFS on both real and synthetic problems, demonstrating that across a variety of data sets and levels of missingness we can improve the accuracy of imputations. Improved imputation due to UAFS also results in improved prediction accuracy when performing supervised learning using these imputed data sets. Our UAFS method is general and can be fruitfully coupled with a variety of imputation methods.
翻译:许多真实世界的数据集都关注缺失的数据,而且往往需要估算方法来估计缺失数据的价值,但数据组缺损过多和高度维度高对多数估算方法提出了挑战。我们在这里表明,适当的特征选择可能是估算的有效预处理步骤,可以进行更准确的估算和随后的模型预测。这一预处理的关键特征是包含不确定性:在选择特征时,通过计算缺失的不确定性,我们可以降低缺失程度,同时限制用于制作预测模型的非信息性特征的数量。我们采用了一种方法,进行不确定性特征选择(UAFS),提供理论动力,并在实际和合成问题上测试UAFS,表明在各种数据集和缺失程度上,我们可以提高估算的准确性。由于UAFS的改进,在使用这些估算数据集进行有监督的学习时,预测的准确性也会提高。我们的UAFS方法很笼统,并且可以与各种估算方法相结合。