Imputing missing values is an important preprocessing step in data analysis, but the literature offers little guidance on how to choose between different imputation models. This letter suggests adopting the imputation model that generates a density of imputed values most similar to those of the observed values for an incomplete variable after balancing all other covariates. We recommend stable balancing weights as a practical approach to balance covariates whose distribution is expected to differ if the values are not missing completely at random. After balancing, discrepancy statistics can be used to compare the density of imputed and observed values. We illustrate the application of the suggested approach using simulated and real-world survey data from the American National Election Study, comparing popular imputation approaches including random forests, hot-deck, predictive mean matching, and multivariate normal imputation. An R package implementing the suggested approach accompanies this letter.
翻译:计算缺失值是数据分析中一个重要的处理前步骤, 但文献对于如何在不同的估算模型之间作出选择很少提供指导。 这封信建议采用估算模型, 产生一个在平衡所有其他变量之后, 与观察到的不完全变量最相似的估算值密度。 我们建议稳定平衡加权, 以此作为平衡共同变量的实用方法, 如果数值并非完全随机缺失, 预计其分布会有所不同 。 平衡后, 差异统计可用于比较估算值和观测值的密度。 我们用美国国家选举研究的模拟和真实世界调查数据来说明建议采用的方法, 比较流行估算法, 包括随机森林、 热层、 预测平均值匹配和多变量正常估算。 R 套件执行建议的计算法, 与这封信相对应。