As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.
翻译:由于研究界日益重视隐私和信任问题,已作出各种努力,将文本数据匿名,其中很大一部分办法包括了干扰嵌入字词的有差别的私人机制,从而取代了单词句中的单词。这些方法代表着非常重要的贡献,与其他技术相比,具有各种优势,确实显示出匿名能力,但是它们有一些缺点。在本文件中,我们调查了这些弱点,并显示出重大的数学限制,削弱了理论上的隐私保障,以及在防止匿名攻击、保留原句内容以及语言产出质量方面存在着重大的实际缺陷。最后,我们提出了一种基于变异器语言模型的文本匿名化新方法,该变异器语言模型经过微调,以绕过大多数已查明的弱点,并提供正式的隐私保障。我们通过彻底的实验来评估我们的方法的绩效,并展示对讨论的机制的优异性表现。