默认值或非默认值? 处理中缺失的数据 (To Impute or not to Impute? Missing Data in Treatment Effect Estimation)

Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.

翻译：在估计治疗效果时造成噪音和偏见的实际假设中,缺失的数据是一个系统性问题,在估计治疗效果时,会造成噪音和偏见。这使得从缺失数据中估算治疗效果是一项特别棘手的工作。关键的原因是,除了个人和结果之外,由于存在额外的变量、治疗和治疗之外,对失踪情况的标准假设也不够充分。如果有一个治疗变量增加了复杂性,则某些变量的缺失情况没有被先前的工作充分探讨过。在我们的工作中,我们确定了一个新的缺失情况机制,我们称之为混合的失踪情况(MCM),在这种机制中,某些缺失决定了治疗选择,而其他缺失情况是由治疗选择决定的。鉴于 MCM,我们证明对所有数据进行天真的估算会导致不良的治疗效果模型,因为估算行为有效地消除了提供不偏倚估计所必要的信息。然而,没有任何一种治疗变量的估算也会导致偏颇的估计,因为治疗所决定的缺失使不同亚群的人口产生偏颇的估计数。我们的解决方案是选择性估算。我们从MM的洞察到确切地告知哪些变量应该被估算,哪些变量不应该用来比较其他解决办法。我们实验性地证明各种学习者如何从其他解决办法受益。