We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other applications such as sequence mining, and is a generalization of recently studied differentially private set union (DPSU). In this paper, we develop a new differentially private algorithm for this problem which, in our experiments, significantly outperforms the state-of-the-art. Our improvements stem from combining recent advances in DPSU, privacy accounting, and new heuristics for pruning in the tree-based approach initiated by Chen et al. (2012).
翻译:我们重新审视了在不同的隐私环境中以美元计价的提取问题。 在这个问题中,考虑到大量的私人文本数据,目标是在维护用户隐私的同时尽可能释放以美元计价的单位。提取美元计价是许多国家专利协议应用中的一个基本的次级常规,如完成判决、电子邮件响应生成等。 问题还出现在其他应用中,如序列开采等,也是最近研究的有差别的私人集合(DPSU)的概括化。 在本文中,我们为该问题开发了一种新的有差别的私人算法,在我们的实验中,它大大优于最新水平。 我们的改进源于将最近在DPSU、隐私核算和陈等人(2012年)倡议的植树方法中的新修剪方法结合起来的结果。