Even the most carefully curated economic data sets have variables that are noisy, missing, discretized, or privatized. The standard workflow for empirical research involves data cleaning followed by data analysis that typically ignores the bias and variance consequences of data cleaning. We formulate a semiparametric model for causal inference with corrupted data to encompass both data cleaning and data analysis. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency for our estimator of the causal parameter by finite sample arguments. The rate of Gaussian approximation is $n^{-1/2}$ for global parameters such as average treatment effect, and it degrades gracefully for local parameters such as heterogeneous treatment effect for a specific demographic. Our key assumption is that the true covariates are approximately low rank. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations calibrated to resemble differential privacy as implemented in the 2020 US Census.
翻译:经验性研究的标准工作流程涉及数据清理,然后进行数据分析,通常忽视数据清理的偏差和差异后果。我们制定了对腐败数据进行因果关系推断的半参数模型,既包括数据清理,也包括数据分析。我们提出了数据清理、估算和数据清理调整信任间隔的端对端新程序。我们用限定抽样参数对因果参数的估测员证明了一致性、高斯近距离和半参数效率。对于平均治疗效果等全球参数而言,高斯近差率为$ ⁇ -1/2美元,对本地参数(如特定人口的不同治疗效果)的优度也有所降低。我们的主要假设是,真实的共变数的等级大约较低。我们的分析为矩阵完成、统计学习和半对称统计统计统计提供了不附带的理论贡献。我们核查了根据2020年美国人口普查所实施的与差异隐私相校准的模拟中数据清理调整信任间隔的覆盖面。