Previous work has shown that Large Language Models are susceptible to so-called data extraction attacks. This allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. The construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of untargeted attacks and memorization. Thus, targeted attacks are proposed, which identify if a given sample from the training data, is extractable from a model. In this work, we apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We apply a two-step approach. In the first step, we maximise the recall of the model and are able to extract the suffix for 69% of the samples. In the second step, we use a classifier-based Membership Inference Attack on the generations. Our AutoSklearn classifier achieves a precision of 0.841. The full approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
翻译:先前的工作显示, 大语言模型很容易被所谓的数据提取攻击。 这使得攻击者能够提取培训数据中包含的样本, 这对隐私具有巨大的影响。 数据提取攻击的构建具有挑战性, 目前的攻击相当低效, 在非目标攻击和记忆化的提取能力方面还存在巨大的差距。 因此, 提出了有针对性的攻击, 确定培训数据中的某个样本是否可从模型中提取。 这项工作中, 我们对 SATML2023 语言模拟培训数据提取挑战 进行了定向数据提取攻击。 我们采用了两步方法。 在第一步, 我们最大限度地回收模型, 并能够提取69%的样本的后缀。 在第二步, 我们使用基于分类的会员身份推断攻击代代代。 我们的自动斯克勒恩分类器实现了0. 841 的精确度。 整个方法达到0. 405 分, 以10%的假正率回回记, 这比0. 301 基线提高了 34% 。