Data-driven systems are gathering increasing amounts of data from users, and sensitive user data requires privacy protections. In some cases, the data gathered is non-numerical or symbolic, and conventional approaches to privacy, e.g., adding noise, do not apply, though such systems still require privacy protections. Accordingly, we present a novel differential privacy framework for protecting trajectories generated by symbolic systems. These trajectories can be represented as words or strings over a finite alphabet. We develop new differential privacy mechanisms that approximate a sensitive word using a random word that is likely to be near it. An offline mechanism is implemented efficiently using a Modified Hamming Distance Automaton to generate whole privatized output words over a finite time horizon. Then, an online mechanism is implemented by taking in a sensitive symbol and generating a randomized output symbol at each timestep. This work is extended to Markov chains to generate differentially private state sequences that a given Markov chain could have produced. Statistical accuracy bounds are developed to quantify the accuracy of these mechanisms, and numerical results validate the accuracy of these techniques for strings of English words.
翻译:由数据驱动的系统正在从用户那里收集越来越多的数据,而敏感的用户数据需要隐私保护。在某些情况下,所收集的数据是非数字性的或象征性的,传统的隐私方法,例如增加噪音,不适用,尽管这些系统仍然需要隐私保护。因此,我们提出了一个保护由符号系统产生的轨迹的新颖的差别隐私框架。这些轨迹可以作为文字或字符串来代表,而不是一个限定的字母。我们开发了新的差异隐私机制,使用可能接近的随机词来接近敏感词。一个离线机制得到了高效实施,使用一个改装的Hamming距离自动马顿在一定的时间范围内生成整个私有化的输出词。然后,一个在线机制通过在每一个时间步骤中采集敏感符号和生成随机化输出符号来实施。这项工作扩大到Markov链,以产生一个特定Markov链本可以产生的差别化的私人状态序列。我们开发了统计准确性约束,以量化这些机制的准确性,并用数字结果验证这些方法对英文字串的准确性。