Missing data is a common problem in medical research, and is commonly addressed using multiple imputation. Although traditional imputation methods allow for valid statistical inference when data are missing at random (MAR), their implementation is problematic when the presence of missingness depends on unobserved variables, i.e. the data are missing not at random (MNAR). Unfortunately, this MNAR situation is rather common, in observational studies, registries and other sources of real-world data. While several imputation methods have been proposed for addressing individual studies when data are MNAR, their application and validity in large datasets with multilevel structure remains unclear. We therefore explored the consequence of MNAR data in hierarchical data in-depth, and proposed a novel multilevel imputation method for common missing patterns in clustered datasets. This method is based on the principles of Heckman selection models and adopts a two-stage meta-analysis approach to impute binary and continuous variables that may be outcomes or predictors and that are systematically or sporadically missing. After evaluating the proposed imputation model in simulated scenarios, we illustrate it use in a cross-sectional community survey to estimate the prevalence of malaria parasitemia in children aged 2-10 years in five subregions in Uganda.
翻译:虽然传统的估算方法允许在随机丢失数据时进行有效的统计推断(MAR),但如果缺失数据的存在取决于未观察到的变量,即数据并非随机缺失(MNAR),则其实施就成问题。不幸的是,这种MNAR情况相当常见,在观察研究、登记册和现实世界数据的其他来源中,这种情况是相当常见的。虽然在数据为MNAR时提出了一些估算方法来处理个别研究,但在多层次结构的大型数据集中,其应用和有效性仍然不明确。因此,我们探索了MNAR数据在等级数据深度方面的结果,并提议对集群数据集中常见缺失模式采用新的多层次估算方法。这种方法以赫克曼选择模型的原则为基础,并采用两阶段元分析方法来预测可能是结果或预测器,而且系统或零星间缺失的二进和连续变量。在模拟情景中评估了拟议的估算模型后,我们举例说明了在乌干达五-十年年龄儿童错位的跨分区调查中使用该方法来估计疟疾流行程度。