Language Models (LMs) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking Personally Identifiable Information (PII) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure. In this work, we introduce rigorous game-based definitions for three types of PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. We empirically evaluate the attacks against GPT-2 models fine-tuned with and without defenses on three domains: case law, health care, and e-mails. Our main contributions are (i) novel attacks that can extract up to 10$\times$ more PII sequences than existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of PII disclosure but still leaks about 3% of PII sequences, and (iii) a subtle connection between record-level membership inference and PII reconstruction.
翻译:语言模型已被证明可以通过句子级成员推断和重构攻击泄露有关训练数据的信息。了解语言模型泄露个人身份信息(PII)的风险得到了较少关注,这可以归因于错误的假设,即诸如数据清理等数据集策略足以防止PII泄露。清洗技术可以减少但不能完全防止PII泄露的风险:在实践中,清洗是不完美的,并且必须在最小化披露和保留数据集实用性之间平衡权衡。另一方面,目前尚不清楚针对句子级或用户级隐私设计的差分隐私等算法防御措施能够在多大程度上防止PII泄露。在本研究中,我们基于黑盒提取、推断和重构攻击引入了三种类型的PII泄漏的严格对抗定义, 并且仅使用LM的API访问。我们在三个领域上对经过和未经过防御的GPT-2模型进行了实证攻击:案例法、医疗保健和电子邮件。我们的主要贡献是(i) 新颖的攻击方法可以提取多达10倍的PII序列比现有攻击更多,(ii) 显示句子级差分隐私减少了PII披露的风险,但仍泄漏约3%的PII序列,以及(iii) 记录级成员推断和PII重构之间的微妙联系。