Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.
翻译:远程 Dirichlet分配(LDA)是发现文本数据集隐藏的语义结构的流行主题模型技术,在许多机器学习应用程序中发挥着根本作用。然而,与许多其他机器学习算法一样,培训LDA模型的过程可能会泄露培训数据集的敏感信息,并带来重大隐私风险。为了减轻LDA的隐私问题,我们着重研究本文LDA模型培训的隐私保护算法。特别是,我们首先开发了一种隐私监测算法,以调查在集中整理数据集方面典型的LDA培训算法中,从Colfroundd Gibbs Sampling(CGS)进程固有的随机性中获得的隐私保障。然后,我们进一步提议建立本地私人LDA关于众源数据的培训算法,为个人数据提供者提供本地差异的隐私。真实世界数据集的实验结果显示了我们提议的算法的有效性。