With the wide availability of large pre-trained language models such as GPT-2 and BERT, the recent trend has been to fine-tune a pre-trained model to achieve state-of-the-art performance on a downstream task. One natural example is the "Smart Reply" application where a pre-trained model is tuned to provide suggested responses for a given query message. Since these models are often tuned using sensitive data such as emails or chat transcripts, it is important to understand and mitigate the risk that the model leaks its tuning data. We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline and introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data. We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data. We explore potential mitigation strategies and demonstrate empirically how differential privacy appears to be an effective defense mechanism to such pattern extraction attacks.
翻译:随着诸如GPT-2和BERT等受过训练的大型语言模型的广泛提供,最近的趋势是微调一个经过训练的模型,以在下游任务中取得最先进的业绩。一个自然的例子就是“智能回馈”应用,在这种应用中,对一个经过训练的模型进行了调整,以便为特定查询信息提供建议的答复。由于这些模型经常使用电子邮件或聊天记录誊本等敏感数据来调整,因此必须理解和减少这些模型泄漏其调试数据的风险。我们调查典型的智能应答管道中潜在的信息渗漏脆弱性,并引进一种新型主动抽取攻击,利用含有敏感数据的文本中的罐头模式。我们实验性地表明,对手有可能提取培训数据中存在的敏感的用户信息。我们探索潜在的缓解战略,并用经验来证明,不同隐私似乎是这种模式抽取攻击的有效防御机制。