It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
翻译:本文表明,在这种环境下,对手可以实施培训数据提取攻击,通过查询语言模型来恢复个人培训实例。我们展示了我们对GPT-2的攻击,GPT-2是经过公共互联网废料培训的语文模型,能够从模型的培训数据中提取成百上千个逐字记录序列。这些提取的例子包括(公共)个人识别的信息(姓名、电话号码和电子邮件地址)、IRC的谈话、代码和128比特UUUID。我们的攻击是可能的,尽管上述每个序列都只包含在培训数据中的一个文件中。我们全面评估了我们的提取攻击,以了解有助于其成功的因素。我们担心的是,我们发现更大的模型比较小的模型更加脆弱。我们最后通过总结经验教训和讨论培训大型语言模型的可能保障。