The 2020 US Census will be published with differential privacy, implemented by injecting synthetic noise into the data. Controversy has ensued, with debates that center on the painful trade-off between the privacy of respondents and the precision of economic analysis. Is this trade-off inevitable? To answer this question, we formulate a semiparametric model of causal inference with high dimensional data that may be noisy, missing, discretized, or privatized. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of Gaussian approximation is $n^{-1/2}$ for semiparametric estimands such as average treatment effect, and it degrades gracefully for nonparametric estimands such as heterogeneous treatment effect. Our key assumption is that the true covariates are approximately low rank, which we interpret as approximate repeated measurements and validate in the Census. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations. Finally, we conduct a semi-synthetic exercise calibrated to privacy levels mandated for the 2020 US Census.
翻译:2020年美国人口普查将以不同的隐私发布,在数据中注入合成噪音。随后出现了争议,辩论集中在答卷人隐私和经济分析精确度之间的痛苦权衡。这种权衡是不可避免的吗?为了回答这个问题,我们制定了一个因果推断的半参数模型,其中含有高维数据,这些数据可能噪音、缺失、离散或私有化。我们提出了一个新的端对端程序,用于数据清理、估计和与数据清理调整信任间隔的推论。我们证明了一致性、高斯近比值和通过有限抽样参数的半对称效率。高斯近比值为美元/美元,用于半准估量估计值(例如平均治疗效果),而且它优于非参数估计值(例如混杂治疗效应)。我们的主要假设是,真实的变数的等级几乎是低级,我们在普查中将其解读为重复测量和验证。我们在分析中为矩阵完成、统计学习和半偏偏差的精确度(例如)度(我们验证)的准确性理论贡献。我们验证了在2020年进行数据修正时对数据等级的覆盖率。