User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $\epsilon$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.
翻译:用户语言数据可以包含高度敏感的个人内容。 因此, 向用户提供从数据中学习时的强有力和可解释的隐私保障至关重要。 在这项工作中, 我们提议SentDP: 在单一用户文件的句级上, 纯粹的当地差异隐私。 我们提议一种新颖技术, DeepCondidate, 将强力统计和语言模型的概念结合起来, 以生成高维、 通用$\ epsilon$- SentDP 文档嵌入。 这保证了文件中的任何单句可以替换为任何其他的句子, 同时保留嵌入 $\ epsilon$- intendingishable 。 我们的实验表明, 这些私人文件嵌入对下游任务很有用, 比如情绪分析和主题分类, 甚至超越了像字级Metric DP这样的较弱的保证的基线方法。