Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fr\'echet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.
翻译:文本数据对于大型语言模型(LLM)乃至通用人工智能(AGI)的发展具有极高价值。现实世界中大量高质量文本因涉及隐私问题而具有私密性,无法被自由使用。为此,差分隐私(DP)合成文本生成方法被提出,其目标是在保护敏感信息的同时生成高可用性的合成数据。然而,现有DP合成文本生成方法采用统一的隐私保障机制,往往会对非敏感内容造成过度保护,导致显著的效用损失和计算开销。为此,我们提出基于秘密保护的进化(SecPE)框架,该框架通过引入秘密感知保护机制扩展了私有进化方法。在理论上,我们证明SecPE满足$(\mathrm{p}, \mathrm{r})$-秘密保护条件,该条件构成高斯差分隐私(GDP)的一种松弛形式,能够实现更优的效用-隐私权衡,同时相比基线方法显著降低了计算复杂度。在实证研究中,基于OpenReview、PubMed和Yelp基准数据集的实验表明,SecPE在保持相同保护水平时所需添加的噪声更少,且始终取得比基于GDP的Aug-PE基线更低的Fr\'echet Inception距离(FID)和更高的下游任务准确率。我们的研究结果表明,秘密感知的隐私保障机制能够为实现更实用、更高效的隐私保护合成文本生成提供新的路径。