Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that ensures element-level differential privacy while retaining, to the extent possible, the utility of a respective non-private weighted sample. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art also for the well-studied special case of private histograms, when no sampling is performed. We empirically demonstrate significant performance gains compared with prior baselines: 20%-300% increase in key reporting for common Zipfian frequency distributions and accuracy for $\times 2$-$ 8$ lower frequencies in estimation tasks. Moreover, PWS is applied as a simple post-processing of a non-private sample, without requiring the original data. This allows for seamless integration with existing implementations of non-private schemes and retaining the efficiency of schemes designed for resource-constrained settings such as massive distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired.
翻译:通用数据集具有关键要素的形式(如交易和产品),目标是对关键和频对组合的组合形式进行分析。按(函数)频率加权键样本是一个高度多功能的概要,提供一套稀少的代表性关键,支持对查询统计的近似评价。我们建议私人加权抽样:一种确保元素级差异隐私的方法,同时尽可能保留相关非私人加权样本的实用性。PWS尽量扩大广泛统计系列的关键和估计质量的可靠性和估计性能报告概率。PWS还改进了对精细研究的私人直方图特殊案例的先进程度,而没有进行抽样。我们从经验上表明,与以往基线相比,业绩有显著提高:通用Zipfian频率分布和准确度的主要报告增加了20%至300%,同时在估算任务中保留了2美元至8美元的低频率。此外,PWSS是作为非私人抽样的简单后处理,不需要进行精细的私人直图特殊案例,在不进行抽样时,在不进行取样的情况下,在不进行抽样调查的情况下,我们可以将现有的数据流流化的方法纳入。