Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
翻译:本地差异隐私(LDP)的原始文本改写是最近的一种方法,它使敏感文本文件能够共享,同时正式保障个人隐私保护,但现有系统面临若干问题,如正式数学缺陷、不切实际的隐私保障、单词私营化以及缺乏透明度和可复制性。在本文件中,我们提出了一个新的“DP-BART”系统,该系统基本上优于现有的LDP系统。我们的方法使用一种新型剪贴方法、迭代剪裁和内部代表的进一步培训,该方法大大减少了DP保障所需的噪音数量。我们尝试了五套不同尺寸的文本数据集,将其改写成不同的隐私保障,并评估了下游文本分类任务的改写文本。最后,我们透彻地讨论了私有化文本改写方法及其局限性,包括导致高噪音要求的LDP模式的严格文本对称限制问题。