In Natural Language Understanding (NLU) applications, training an effective model often requires a massive amount of data. However, text data in the real world are scattered in different institutions or user devices. Directly sharing them with NLU service provider brings huge privacy risks, as text data often contains sensitive information, leading to potential privacy leakage. A typical way to protect privacy is to directly privatize raw text and leverage Differential Privacy (DP) to quantify the privacy protection level. However, existing text privatization mechanisms that privatize text by applying $d_{\mathcal{\chi}}$-privacy are not applicable for all similarity metrics and fail to achieve a good privacy-utility trade-off. This is primarily because (1) $d_{\mathcal{\chi}}$-privacy's strict requirements for similarity metrics; (2) they treat each input token equally. Bad utility-privacy trade-off performance impedes the adoption of current text privatization mechanisms in real-world applications. In this paper, we propose a Customised differentially private Text privatization mechanism (CusText) that assigns each input token a customized output set to provide more advanced adaptive privacy protection at the token level. It also overcomes the limitation for the similarity metrics caused by $d_{\mathcal{\chi}}$-privacy notion, by turning the mechanism to satisfy $\epsilon$-DP. Furthermore, we provide two new text privatization strategies to boost the utility of privatized text without compromising privacy and design a new attack strategy for further evaluating the protection level of our mechanism empirically from a new attack's view. We also conduct extensive experiments on two widely used datasets to demonstrate that our proposed mechanism CusText can achieve a better privacy-utility trade-off and practical application value than the existing methods.
翻译:在自然语言理解(NLU)应用程序中,培训一个有效的模型往往需要大量的数据。然而,真实世界的文本数据分散在不同的机构或用户设备中。直接与NLU服务供应商分享这些数据会带来巨大的隐私风险,因为文本数据往往包含敏感信息,导致潜在的隐私泄漏。保护隐私的典型方式是直接将原始文本私有化,并利用差异隐私(DP)来量化隐私保护水平。然而,现有的文本私有化机制,通过应用 $mathcal $hchi $ $- $- privacy 来将文本私有化。然而,现有的文本私有化机制不适用于所有相似的衡量标准,也无法实现良好的隐私使用率交易率交易。这主要是因为:(1) $mathcal_ $ $- privatyl 的严格要求,导致对相似的隐私设计要求;(2) 每种输入量的同等对待。 不良的实用性交易性交易业绩妨碍了在现实世界应用中采用当前的文本私有化机制。我们提议自定义的差别化私人文本攻击性私有化机制(Cutext),将每份输入一个自定义的硬值值数据,用于更高级的版版版版版版版的文本设计设计规则,从而使得新版本的版本的节制的节制化的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制化的节制的节制。