Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data, and aids in the evaluation of potential countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense based on Bloom filters that perfectly prevents all verbatim memorization. And yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. For example, instructing the model to output ALL-CAPITAL texts bypasses memorization checks based on verbatim matching. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.
翻译:在神经语言模型中研究数据记忆,有助于我们理解与培训数据模型重构模型相关的风险(如隐私或版权),以及评估潜在反措施的辅助工具。许多先前的作品 -- -- 以及最近部署的一些防御 -- -- 侧重于“虚拟记忆化”(verbatim memorization),被定义为与培训成套材料的子字符串完全吻合的模型一代。我们争辩说,逐字记录记忆化定义限制性过强,无法捕捉更微妙的记忆化形式。具体地说,我们设计和实施基于“闪光过滤器”的有效防御,完全防止所有逐字记录记忆化。然而,我们证明,这一“完美”过滤器并不能防止培训数据泄漏。事实上,很容易被一个合理和微小修改的“模式转移”提示所绕开来提取记忆化信息,在某些情况下,甚至是未经修改的原始提示。例如,指导模型输出“全美”文本时绕过基于逐字记录匹配的记忆化检查。我们的结论是,我们通过讨论潜在的替代定义和为什么界定记忆化模式是一个非常关键的问题。