Missing data remains a very common problem in large datasets, including survey and census data containing many ordinal responses, such as political polls and opinion surveys. Multiple imputation (MI) is usually the go-to approach for analyzing such incomplete datasets, and there are indeed several implementations of MI, including methods using generalized linear models, tree-based models, and Bayesian non-parametric models. However, there is limited research on the statistical performance of these methods for multivariate ordinal data. In this article, we perform an empirical evaluation of several MI methods, including MI by chained equations (MICE) using multinomial logistic regression models, MICE using proportional odds logistic regression models, MICE using classification and regression trees, MICE using random forest, MI using Dirichlet process (DP) mixtures of products of multinomial distributions, and MI using DP mixtures of multivariate normal distributions. We evaluate the methods using simulation studies based on ordinal variables selected from the 2018 American Community Survey (ACS). Under our simulation settings, the results suggest that MI using proportional odds logistic regression models, classification and regression trees and DP mixtures of multinomial distributions generally outperform the other methods. In certain settings, MI using multinomial logistic regression models is able to achieve comparable performance, depending on the missing data mechanism and amount of missing data.
翻译:缺少的数据仍然是大型数据集中一个非常常见的问题,包括调查和普查数据,其中载有许多常规反应,例如政治民意测验和民意调查。多重估算(MI)通常是分析这种不完整的数据集的捷径方法,而且确实有几处执行MI的方法,包括使用通用线性模型、树基模型和巴伊西亚非参数模型的方法。然而,关于这些多变量正常分布方法的统计性能的研究有限。在本篇文章中,我们对多种MI方法进行了实证评估,包括使用多数值物流回归模型的链式等式MI(MIE),使用比例差物流回归模型的MIICE,使用分类和回归树的随机森林,MI使用多数值分布产品组合的Drichlet(DP)方法,以及使用多变量正常分布的DP混合物。我们用2018年美国社区调查(ACS)中选择的恒定变量的模拟研究方法评估了方法。在模拟设置中,结果显示MI使用比例差物流回归模型,MIICE使用比例回归模型,使用分类和回归树的随机回归模型,使用可比较性模型,使用多数值的多数值模型,使用其他数据回归模型。