We design, to the best of our knowledge, the first differentially private (DP) stream processing system at scale. Our system --Differential Privacy SQL Pipelines (DP-SQLP)-- is built using a streaming framework similar to Spark streaming, and is built on top of the Spanner database and the F1 query engine from Google. Towards designing DP-SQLP we make both algorithmic and systemic advances, namely, we (i) design a novel DP key selection algorithm that can operate on an unbounded set of possible keys, and can scale to one billion keys that users have contributed, (ii) design a preemptive execution scheme for DP key selection that avoids enumerating all the keys at each triggering time, and (iii) use algorithmic techniques from DP continual observation to release a continual DP histogram of user contributions to different keys over the stream length. We empirically demonstrate the efficacy by obtaining at least $16\times$ reduction in error over meaningful baselines we consider.
翻译:我们设计了目前为止第一个具有规模的差分隐私(DP)流处理系统。我们的系统——差分隐私SQL管道(DP-SQLP)——采用与Spark流式处理类似的流式框架构建,并建立在谷歌的Spanner数据库和F1查询引擎之上。为了设计DP-SQLP,我们进行了算法和系统方面的创新,即,我们(i)设计了一种新颖的DP关键字选择算法,可以操作一个无界的可能关键字集合,并且可以扩展到用户已经贡献了十亿个关键字,(ii)设计了一种DP关键字选择的预先执行方案,避免在每个触发时间枚举所有关键字,(iii)使用DP连续观察的算法技术,在流长度上发布用户贡献到不同关键字的连续DP直方图。我们通过至少考虑有意义的基线获得了至少$16\times$的误差降低。