Federal administrative tax data are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is a validation server, which allows individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to implement such a server. We provide an extensive study on existing DP methods for releasing tabular statistics, means, quantiles, and regression estimates. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We evaluate the selected methods based on the accuracy of the output for statistical analyses, using real administrative tax data obtained from the Internal Revenue Service. Our findings show that a validation server is feasible for simple, univariate statistics but struggles to produce accurate regression estimates and confidence intervals. We outline challenges and offer recommendations for future work on validation server frameworks. This is the first comprehensive statistical study of DP regression methodology on a real, complex dataset, that has significant implications for the direction of a growing research field and public policy.
翻译:联邦行政税收数据对研究来说是宝贵的,但是由于隐私方面的考虑,这些数据的获取通常仅限于某些机构和少数个人; 分享微观一级数据的替代办法是使用验证服务器,使个人能够在不直接接触机密数据的情况下查询统计数据; 本文研究使用差别私人(DP)方法实施这种服务器的可行性; 我们对现有DP方法进行广泛研究,以公布表格统计数据、手段、数量和回归估计; 我们还包括对现有DP回归方法进行新的方法调整,以使用新的数据类型和返回标准误差估计; 我们根据统计分析产出的准确性评价选定的方法,使用从国内税收局获得的实际行政税收数据; 我们的调查结果表明,验证服务器对于简单、单方统计是可行的,但很难得出准确的回归估计和信任间隔; 我们概述了关于验证服务器框架的未来工作的挑战和建议; 这是关于真实、复杂的数据集的DP回归方法的首次全面统计研究,这对不断增长的研究领域和公共政策的方向有重大影响。