Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.
翻译:摘要:将神经网络的行为局部化到网络组件的子集或组件之间的交互的子集,是分析网络机制和可能的失灵模式的自然第一步。现有的工作往往是定性和特定情况,对于评估局部化辩论的适当方式没有共识。我们引入了路径修补技术,一种表达和量化一类假设的技术,这类假设表述行为局部化到一组路径。我们推进了感应头的解释,研究了GPT-2的一种行为,并开源了一个框架,以便有效地运行类似的实验。