Privacy concerns have attracted increasing attention in data-driven products and services. Existing legislation forbids arbitrary processing of personal data collected from individuals. Generating synthetic versions of such data with a formal privacy guarantee such as differential privacy (DP) is considered to be a solution to address privacy concerns. In this direction, we show a simple, practical, and effective recipe in the text domain: simply fine-tuning a generative language model with DP allows us to generate useful synthetic text while mitigating privacy concerns. Through extensive empirical analyses, we demonstrate that our method produces synthetic data that is competitive in terms of utility with its non-private counterpart and meanwhile provides strong protection against potential privacy leakages.
翻译:现有立法禁止任意处理从个人收集的个人数据; 生成具有正式隐私保障的合成数据,如差异隐私(DP),被认为是解决隐私问题的一种解决办法; 在这方面,我们在文本领域展示了一个简单、实际和有效的配方:简单地微调一种与DP的配方语言模式,使我们能够产生有用的合成文本,同时减轻隐私问题; 通过广泛的实证分析,我们证明我们的方法产生的合成数据在实用性方面与非私营对口单位具有竞争力,同时提供强有力的保护,防止潜在的隐私泄漏。