默认值或非默认值? 处理中缺失的数据</s> (To Impute or not to Impute? Missing Data in Treatment Effect Estimation)

Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the input (e.g. an individual) and the label (e.g. an outcome). The treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we introduce mixed confounded missingness (MCM), a new missingness mechanism where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment introduces bias in covariates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data. We highlight that our experiments encompass both average treatment effects and conditional average treatment effects.

翻译：在估计治疗效果时造成噪音和偏差的实际假设中,缺失的数据是一个系统性问题,在估计治疗效果时造成噪音和偏差。这使得从缺失数据中估算治疗效果是一项特别棘手的工作。关键的原因是,除了投入(如个人)和标签(如结果)之外,由于存在额外的变量和治疗方法之外,关于缺失的标准假设也不够充分。在实际假设中,缺少数据是一个系统问题,在估计治疗效果时造成噪音和偏见。在实际假设中,处理变量对于某些变量的缺失原因没有被先前的工作充分探讨的原因来说,处理变量的缺失是一个额外的复杂因素。在我们的工作中,我们引入了混合的缺失(MCM),这是一种新的缺失机制,即某些缺失决定了治疗选择,而其他缺失是由治疗选择决定的。鉴于 MCM,我们发现天真地估算所有数据都会导致不良的治疗效果模型,因为估算行为有效地消除了必要的信息,以提供不偏倚的估计。然而,没有任何估算结果也会带来偏颇的估算,因为治疗所决定的缺失导致偏差。我们的解决办法是选择性的估算,我们利用MCMM的洞洞洞察的洞察来准确了解哪些变量来确定哪些变量应该确定哪些解决办法,而不应该将哪些解决办法。我们的经验性实验性地展示了各种平均实验结果。</s>