When an LLM learns a new fact during finetuning (e.g., new movie releases, newly elected pope, etc.), where does this information go? Are entities enriched with relation information, or do models recall information just-in-time before a prediction? Or, are ``all of the above'' true with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually \textit{replace} parts of the residual stream, thus overriding previous information. To fill this gap, we propose \emph{dynamic weight grafting}, a technique that selectively grafts weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) ``enriching" the residual stream with relation information while processing the tokens that correspond to an entity (e.g., ``Zendaya'' in ``Zendaya co-starred with John David Washington'') and 2) ``recalling" this information at the final token position before generating a target fact. In some cases, models need information from both of these pathways to correctly generate finetuned facts while, in other cases, either the ``enrichment" or ``recall" pathway alone is sufficient. We localize the ``recall'' pathway to model components -- finding that ``recall" occurs via both task-specific attention mechanisms and an entity-specific extraction step in the feedforward networks of the final layers before the target prediction. By targeting model components and parameters, as opposed to just activations, we are able to understand the \textit{mechanisms} by which finetuned knowledge is retrieved during generation.
翻译:暂无翻译