Common methods for interpreting neural models in natural language processing typically examine either their structure or their behavior, but not both. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. It enables us to analyze the mechanisms by which information flows from input to output through various model components, known as mediators. We apply this methodology to analyze gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.
翻译:在自然语言处理过程中解释神经模型的共同方法通常会检查其结构或行为,但并非两者兼而有之。我们提出一种基于因果调解分析理论的方法,用于解释一个模型的哪些部分与行为有因果关系。它使我们能够分析信息通过各种模型组成部分(称为调解人)从投入到产出的流通机制。我们运用这一方法分析培训前变异语言模型中的性别偏见。我们研究个别神经元和注意力主管在三个数据集中调解性别偏见方面的作用,这三个数据集旨在衡量模型对性别偏见的敏感性。我们的调解分析表明,性别偏见的影响(一) 稀少,集中在网络的一小部分;(二) 协同、放大或压制,由不同组成部分组成;(三) 分解成直接通过输入和间接通过调解人产生的效果。