Federal administrative tax data are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data are validation servers, which allow individuals to query statistics without accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to implement such a server. We provide an extensive study on existing DP methods for releasing tabular statistics, means, quantiles, and regression estimates. We also include new methodological adaptations to existing DP regression algorithms for using new data types and returning standard error estimates. We evaluate the selected methods based on the accuracy of the output for statistical analyses, using real administrative tax data obtained from the Internal Revenue Service Statistics of Income (SOI) Division. Our findings show that a validation server would be feasible for simple statistics but would struggle to produce accurate regression estimates and confidence intervals. We outline challenges and offer recommendations for future work on validation servers. This is the first comprehensive statistical study of DP methodology on a real, complex dataset, that has significant implications for the direction of a growing research field.
翻译:联邦行政税收数据对研究来说是宝贵的,但是由于隐私方面的考虑,这些数据的获取通常仅限于某些机构和少数个人; 分享微观一级数据的替代办法是验证服务器,这种服务器使个人可以查询统计数据而无需查阅机密数据; 本文研究使用差别私人(DP)方法实施这种服务器的可行性; 我们对现有DP方法进行广泛研究,以公布表格统计数据、手段、量化和回归估计; 我们还包括对现有DP回归算法进行新的方法调整,以便使用新的数据类型和返回标准错误估计; 我们根据统计分析产出的准确性评价选定的方法,使用国内税收局收入统计司(SOI)获得的实际行政税收数据; 我们的调查结果显示,验证服务器对于简单的统计来说是可行的,但将难以得出准确的回归估计和信任间隔; 我们概述了关于验证服务器的未来工作的挑战和建议; 这是关于真实、复杂的数据集的DP方法的首次全面统计研究,对不断增长的研究方向有重大影响。