The protection of private information is of vital importance in data-driven research, business, and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are $k$-anonymity (often implemented via microaggregation) and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees, and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy, and synthetic data.
翻译:保护私人信息在数据驱动的研究、商业和政府中至关重要。隐私和公用之间的矛盾引发了计算机科学和统计界的深入研究,他们开发了多种保护隐私的数据发布方法。出现的主要概念包括:匿名(通常通过微集法实施)和隐私差异。今天,另一个解决方案正在获得牵引、合成数据。然而,隐私之路与NP的难题密不可分。在本文中,我们侧重于开发一种计算效率高的合成数据生成方法、带有可变隐私保障和严格量化数据效用的NP-硬性挑战。我们通过研究一个基本但初步看完全无关的问题来解决这一问题的宽松版本。也就是说,我们找到一个几乎最佳和建设性的答案来回答这样一个问题:我们有条件的预期会丢失多少信息。令人惊讶的是,这种对理论概率的探索产生了数学技术,使我们能够为微观隔离、隐私和合成数据等难以应用的问题找到建设性的、大概最佳解决办法。