The protection of private information is of vital importance in data-driven research, business, and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are anonymity and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees, and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy, and synthetic data.
翻译:保护私人信息在数据驱动的研究、商业和政府中至关重要。隐私和公用事业之间的冲突引发了计算机科学和统计界的密集研究,它们开发了多种保护隐私的数据发布方法。在出现的主要概念中,出现了匿名和差异隐私。今天,另一个解决办法正在获得牵引、合成数据。然而,隐私之路被NP的难题所铺设。在本文件中,我们侧重于开发一种高效计算、具有可验证隐私保障和严格量化数据的实用性的合成数据生成方法的NP硬挑战。我们通过研究一个基本但初见完全无关的问题来解决这一问题的宽松版本。也就是说,我们找到一个近乎最佳和建设性的答案,解决当我们有条件期望时丢失多少信息的问题。令人惊讶的是,这种理论可能性的推理可能性产生数学技术,使我们能够为微集聚、隐私和合成数据方面难以解决的问题找到建设性的、大概最佳解决办法。