To protect the privacy of individuals whose data is being shared, it is of high importance to develop methods allowing researchers and companies to release textual data while providing formal privacy guarantees to its originators. In the field of NLP, substantial efforts have been directed at building mechanisms following the framework of local differential privacy, thereby anonymizing individual text samples before releasing them. In practice, these approaches are often dissatisfying in terms of the quality of their output language due to the strong noise required for local differential privacy. In this paper, we approach the problem at hand using global differential privacy, particularly by training a generative language model in a differentially private manner and consequently sampling data from it. Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets taking on specific desired attributes such as sentiment or topic and resembling statistical properties of the training data. We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality and highly suitable for training models for further analysis on real-world data. Notably, we also demonstrate that training classifiers on private synthetic data outperforms directly training classifiers on real data with DP-SGD.
翻译:为保护数据共享的个人隐私,制定方法,使研究人员和公司在向数据发端人提供正式隐私保障的同时发布文本数据非常重要。在《国家劳工政策》领域,已作出大量努力,以建立地方差异隐私权框架的机制,从而在公布个人文本样本之前匿名;在实践中,这些方法往往由于当地差异隐私要求的强烈噪音而对其产出语言的质量不满意。在本文中,我们利用全球差异隐私来解决这一问题,特别是以不同私人方式培训基因化语言模型,从而从中取样数据。利用自然语言提示和新的即时即时损失,我们能够建立高度准确和流畅的文本数据集,利用特定期望的属性,如情绪或主题,并结合培训数据的统计属性。我们进行彻底的实验,表明我们的合成数据集不会泄露我们原始数据中的信息,语言质量很高,而且非常适合培训模型,以便进一步分析真实世界数据。特别是,我们还演示了对DPGlasrigiers进行的真正合成数据分类培训。