Multiple imputation is increasingly used in tackling missing data. While some conventional multiple imputation approaches are well studied and have shown empirical validity, they entail limitations in processing large datasets with complex data structures. Their imputation performances usually rely on proper specifications of imputation models, which require expert knowledge of the inherent relations among variables. In addition, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable multiple imputation framework mixgb, which is based on XGBoost, bootstrapping and predictive mean matching. XGBoost, one of the fastest implementations of gradient boosted trees, is able to automatically retain interactions and non-linear relations in a dataset while achieving high computational efficiency. With the aid of bootstrapping, and predictive mean matching, we show that our approach obtains less biased estimates and better reflects appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.
翻译:处理缺失数据时越来越多地使用多种估算方法。 虽然一些传统的多重估算方法经过了仔细研究,并显示出了经验有效性,但它们在处理具有复杂数据结构的大型数据集方面造成了局限性。其估算性表现通常依赖于估算模型的适当规格,这要求对变量之间的内在关系有专家知识。此外,这些标准方法往往在计算中效率低下。在本文件中,我们提出了一个基于XGBoost、靴子穿刺和预测平均值匹配的可缩放多重估算框架组合。 XGBoost是梯度增强的树的最快执行之一,它能够在数据集中自动保留互动和非线性关系,同时实现较高的计算效率。在推进和预测平均值匹配的帮助下,我们表明我们的方法获得的估计数不那么偏差,并且更好地反映适当的估算变异性。拟议框架在R包组合组合中实施。此文章的补充材料可在网上查阅。