Missing data remains a very common problem in large datasets, including survey and census data containing many ordinal responses, such as political polls and opinion surveys. Multiple imputation (MI) is usually the go-to approach for analyzing such incomplete datasets, and there are indeed several implementations of MI, including methods using generalized linear models, tree-based models, and Bayesian non-parametric models. However, there is limited research on the statistical performance of these methods for multivariate ordinal data. In this article, we perform an empirical evaluation of several MI methods, including MI by chained equations (MICE) using multinomial logistic regression models, MICE using proportional odds logistic regression models, MICE using classification and regression trees, MICE using random forest, MI using Dirichlet process (DP) mixtures of products of multinomial distributions, and MI using DP mixtures of multivariate normal distributions. We evaluate the methods using simulation studies based on ordinal variables selected from the 2018 American Community Survey (ACS). Under our simulation settings, the results suggest that MI using proportional odds logistic regression models, classification and regression trees and DP mixtures of multinomial distributions generally outperform the other methods. In certain settings, MI using multinomial logistic regression models and DP mixtures of multivariate normal distributions, are able to achieve comparable performance, depending on the missing data mechanism and amount of missing data.
翻译:在大型数据集中,缺少的数据仍然是一个非常常见的问题,在大型数据集中,包括调查数据和普查数据,其中载有许多常规反应,例如政治民意测验和民意调查。多重估算(MI)通常是分析这种不完整数据集的上至方法,而且确实有几个执行MI的方法,包括使用通用线性模型、树基模型和巴伊西亚非参数模型的方法。然而,关于这些多变或异性正常分布方法的统计性能的研究有限。在本篇文章中,我们对多种MI方法进行了实证评估,包括使用多数值物流回归模型(MIE)以链式方程式(MIE)对多种MI方法进行了实证性评估。在模拟环境中,使用比例性差物流回归模型、MIICE使用分类和回归树图,使用随机森林,使用多数值分布产品混合物(DP),使用多种变异性正常分布的DP混合物。我们用从2018年美国社区调查(ACS)中选取的恒定变量进行模拟研究的方法评估。在模拟环境中,结果显示MIIL使用比例性物流回归模型、分类和多数值模型,使用其他可比较性结构分布。