Even the most carefully curated economic data sets have variables that are noisy, missing, discretized, or privatized. The standard workflow for empirical research involves data cleaning followed by data analysis that typically ignores the bias and variance consequences of data cleaning. We formulate a semiparametric model for causal inference with corrupted data to encompass both data cleaning and data analysis. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove root-n consistency, Gaussian approximation, and semiparametric efficiency for our estimator of the causal parameter by finite sample arguments. Our key assumption is that the true covariates are approximately low rank. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations.
翻译:即使是经过最仔细整理的经济数据集,也有杂乱、缺失、分散或私营化的变量。经验研究的标准工作流程包括数据清理,然后是数据分析,通常忽视数据清理的偏差和差异后果。我们用腐败数据制定一个因果推断的半参数模型,既包括数据清理,也包括数据分析。我们提出了数据清理、估算和与数据清理调整信任间隔的端对端新程序。我们用有限的抽样参数来估计因果参数,证明根与端一致、高斯近似和半对称效率。我们的关键假设是真实的共变体的等级大致较低。我们的分析是,我们为矩阵的完成、统计学习和半对称统计提供非被动的理论贡献。我们在模拟中核查数据清理调整信任间隔的覆盖范围。