Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for hidden semantic discovery of text data and serves as a fundamental tool for text analysis in various applications. However, the LDA model as well as the training process of LDA may expose the text information in the training data, thus bringing significant privacy concerns. To address the privacy issue in LDA, we systematically investigate the privacy protection of the main-stream LDA training algorithm based on Collapsed Gibbs Sampling (CGS) and propose several differentially private LDA algorithms for typical training scenarios. In particular, we present the first theoretical analysis on the inherent differential privacy guarantee of CGS based LDA training and further propose a centralized privacy-preserving algorithm (HDP-LDA) that can prevent data inference from the intermediate statistics in the CGS training. Also, we propose a locally private LDA training algorithm (LP-LDA) on crowdsourced data to provide local differential privacy for individual data contributors. Furthermore, we extend LP-LDA to an online version as OLP-LDA to achieve LDA training on locally private mini-batches in a streaming setting. Extensive analysis and experiment results validate both the effectiveness and efficiency of our proposed privacy-preserving LDA training algorithms.
翻译:远程Drichlet分配(LDA)是隐藏文字数据语义发现的一种流行主题模型技术,是各种应用中文本分析的基本工具,但LDA模式以及LDA培训过程可能暴露培训数据中的文本信息,从而带来重大的隐私问题。为了解决LDA的隐私问题,我们系统地调查基于Clobled Gibbs抽样(CGS)的LDA主要培训算法的隐私保护问题,并为典型的培训情景提出若干不同的私人LDA算法。我们特别就基于LDA培训的CGS固有的差异隐私保障提出了第一次理论分析,并进一步提出了中央隐私保护算法(HDP-LDA),该算法可以防止CGS培训的中间统计数据产生误差。此外,我们提议采用当地私人LDA培训算法(LP-LDA),为个人数据提供者提供本地差异隐私权。此外,我们将LP-LDA扩大到LDA的在线版本,作为OLP-LDA的LDA,以便实现LDA在LDA系统测试和LDA成果的当地私人保密效率的测试。我们提议的LDA。