Recent advances in neural network based language models lead to successful deployments of such models, improving user experience in various applications. It has been demonstrated that strong performance of language models may come along with the ability to memorize rare training samples, which poses serious privacy threats in case the model training is conducted on confidential user content. This necessitates privacy monitoring techniques to minimize the chance of possible privacy breaches for the models deployed in practice. In this work, we introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model. We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data. Our metrics further enable comparing different models trained on the same data in terms of privacy. We demonstrate our approach through extensive numerical studies on real-world datasets such as email and forum conversations. We further illustrate how the proposed metrics can be utilized to investigate the efficacy of mitigations like differentially private training or API hardening.
翻译:以神经网络为基础的语言模型最近的进展导致成功部署这些模型,改进了用户在各种应用方面的经验;已经证明,语言模型的有力表现可能伴随着对稀有培训样本进行记忆的能力,这对保密用户内容的示范培训造成严重的隐私威胁;这就需要采用隐私监测技术,以尽量减少实际应用模型可能侵犯隐私的可能性;在这项工作中,我们采用了一种方法,调查在培训数据中可能因强力和现实的威胁模型而泄漏的用户内容。我们提出了两种衡量用户数据渗漏的指标,以量化模型在培训数据中产生独特句号碎片的能力。我们的指标进一步使得能够比较在隐私方面同样数据方面受过训练的不同模型。我们通过对真实世界数据集,例如电子邮件和论坛对话进行广泛的数字研究,展示了我们的方法。我们进一步说明了如何利用拟议的指标来调查缓解措施的效力,例如差别化的私人培训或API变硬化。