We investigate the mechanisms underlying factual knowledge recall in autoregressive transformer language models. First, we develop a causal intervention for identifying neuron activations capable of altering a model's factual predictions. Within large GPT-style models, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. This insight inspires the development of ROME, a novel method for editing facts stored in model weights. For evaluation, we assemble CounterFact, a dataset of over twenty thousand counterfactuals and tools to facilitate sensitive measurements of knowledge editing. Using CounterFact, we confirm the distinction between saying and knowing neurons, and we find that ROME achieves state-of-the-art performance in knowledge editing compared to other methods. An interactive demo notebook, full code implementation, and the dataset are available at https://rome.baulab.info/.
翻译:我们调查了自动递减变压器语言模型中的事实知识回溯机制。 首先,我们开发了一种因果干预机制,以识别能够改变模型事实预测的神经激活。 在大型的GPT型模型中,这揭示了两组不同的神经元,我们假设这些神经元分别相当于了解抽象事实和说一个具体词。这种洞察力激励了ROME的发展,这是编辑存储在模型重量中的事实的新颖方法。为了评估,我们收集了由2万多个反事实组成的数据集,以及便于敏感测量知识编辑的工具。我们利用反事实,我们确认了对神经元的描述和了解之间的区别,我们发现ROME与其他方法相比,在知识编辑方面实现了最先进的表现。交互式的演示笔记本、完整的代码执行和数据集可在 https://rome.baulab.info/上查阅。