Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose language models and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacy-preserving natural language processing over the BERT language model with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.
翻译:然而,文本也传递敏感信息。尽管通用语言模式和有不同隐私(DP)的特定领域机制取得了成功,但现有的文本净化机制仍然能提供低效用,因为高维文本代表的诅咒。为下游分析品使用清洁文本的配套问题也没有得到充分探讨。本文对文本净化采取了直接的方法。我们的洞察力是通过我们新的本地DP概念来考虑敏感性和相似性。经过净化的文本还有助于我们进行防污前培训和微调,从而能够对BERT语言模式进行隐私保护自然语言处理,而且有希望的效用。令人惊讶的是,高效用并没有提高推断攻击的成功率。