Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d $\sqrt$ n regime. Experiments illustrate our findings.
翻译:处理缺失的预测值有两种不同的方法:估算,在安装任何预测算法之前,或者专门的方法能够本地吸收缺失的值。尽管估算是广泛(和容易)使用的,但不幸的是,当低容量预测器(如线性模型)在事后应用时,这种估算有偏差。然而,在实践上,天真估算显示良好的预测性能。在本文中,我们研究了估算在高维线性模型中的影响,而MCAR丢失了数据。我们证明,零估算与脊椎法密切关联,经常用于高维度问题。在这种关联上,我们确定估算偏差受高维度消失的山脊偏差控制。作为预测者,我们主张赞成平均SGD战略,适用于零指数数据。我们对其一般化错误设置了上层界限,强调在 d\sqrt$ n 系统中的估算值是无害的。实验展示了我们的调查结果。