Modern neural language models widely used in tasks across NLP risk memorizing sensitive information from their training data. As models continue to scale up in parameters, training data, and compute, understanding memorization in language models is both important from a learning-theoretical point of view, and is practically crucial in real world applications. An open question in previous studies of memorization in language models is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing "common" memorization such as familiar phrases, public knowledge or templated texts. In this paper, we provide a principled perspective inspired by a taxonomy of human memory in Psychology. From this perspective, we formulate a notion of counterfactual memorization, which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We further estimate the influence of each training example on the validation set and on generated texts, and show that this can provide direct evidence of the source of memorization at test time.
翻译:在整个国家劳工规划局的任务中广泛使用的现代神经语言模型有可能将培训数据中的敏感信息记忆起来。随着模型在参数、培训数据和计算方面继续扩大,理解语言模型中的记忆化从学习理论观点看是十分重要的,在现实世界应用中实际上是至关重要的。以前对语言模型中的记忆化研究中的一个未决问题是如何过滤“共同”记忆化的。事实上,大多数记忆化标准与培训数据集中的事件数量密切相关,捕捉“共同”记忆化,如熟悉的词组、公共知识或模板文本。在本文件中,我们从人类心理学记忆的分类学中提供了原则观点。从这个角度出发,我们形成了一个反事实记忆化的概念,它描述模型预测在培训中如果某一文件被遗漏时如何变化。我们在标准文本数据集中找出并研究反事实记忆化的培训实例。我们进一步估计了每个培训实例对确认数据集和生成的文本的文本的影响。我们进一步估计了每个培训范例对所产生时间测试的来源,并展示了这一证据的源。