Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship anonymization. By augmenting the semantic information through a REINFORCE training reward function, the model can generate differentially private text that has a close semantic and similar grammatical structure to the original text while removing personal traits of the writing style. It does not assume any conditioned labels or paralleled text data for training. We evaluate the performance of the proposed model on the real-life peer reviews dataset and the Yelp review dataset. The result suggests that our model outperforms the state-of-the-art on semantic preservation, authorship obfuscation, and stylometric transformation.
翻译:对文本数据的隐私保护研究大多侧重于删除明确的敏感识别资料。然而,个人写作风格,作为作者身份的有力指标,往往被忽略。最近的研究,如SynTF,显示在隐私保护文本挖掘方面有希望的结果。然而,它们的匿名算法只能输出难以接受者解释的数值矢量。我们提出了一个具有写作匿名的两套指数机制的新版文本生成模型。通过REINFORCE培训奖励功能来增加语义信息,该模型能够产生与原始文本有密切语义和类似语法结构的有差别的私人文本,同时消除文字风格的个人特征。它不假定任何有条件的标签或平行文本数据用于培训。我们评估了拟议版本的实时同行审查数据集和Yelp审查数据集的性能。结果表明,我们的模型超出了关于语义保护、作者混淆和特征转换的状态。