Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal dependencies between connected components. In addition, their discovery process, designed to preserve output faithfulness, inadvertently captures extraneous effects other than an isolated target skill. To alleviate these challenges, we introduce skill paths, which offers a more refined and compact representation by isolating individual skills within a linear chain of components. To enable skill path extracting from circuit graphs, we propose a three-step framework, consisting of decomposition, pruning, and post-pruning causal mediation. In particular, we offer a complete linear decomposition of the transformer model which leads to a disentangled computation graph. After pruning, we further adopt causal analysis techniques, including counterfactuals and interventions, to extract the final skill paths from the circuit graph. To underscore the significance of skill paths, we investigate three generic language skills-Previous Token Skill, Induction Skill, and In-Context Learning Skill-using our framework. Experiments support two crucial properties of these skills, namely stratification and inclusiveness.
翻译:电路图发现已成为阐明语言模型技能机制的一种基本方法。尽管电路图具有输出忠实性,但它们存在原子消融问题,导致连接组件之间的因果依赖关系丢失。此外,其发现过程旨在保持输出忠实性,却无意中捕获了孤立目标技能之外的额外效应。为缓解这些挑战,我们引入了技能路径,通过将单个技能隔离在组件的线性链中,提供了一种更精细且紧凑的表示形式。为了从电路图中提取技能路径,我们提出了一个三步框架,包括分解、剪枝和后剪枝因果中介。具体而言,我们提供了Transformer模型的完整线性分解,从而得到一个解缠的计算图。剪枝后,我们进一步采用因果分析技术,包括反事实和干预,从电路图中提取最终的技能路径。为强调技能路径的重要性,我们使用该框架研究了三种通用语言技能——前词符技能、归纳技能和上下文学习技能。实验支持了这些技能的两个关键属性,即分层性和包容性。