Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.
翻译:神经机能解释的目的是要解释神经网络在坚果和布尔层次上学到了什么。 神经网络的基本原始表现是什么? 以前的机械性描述使用单个神经元或其线性组合来理解一个网络所学的表述。 但是有线索显示神经及其线性组合并不是正确的基本描述单位: 方向无法描述神经网络如何使用非线性来构建其表达结构。 此外, 许多神经神经网络及其组合的例子都是多语义( 即它们具有多重无关的意义 ) 。 多功能性镜像以神经或方向来解释网络,因为我们无法再为一个神经单元指定一个特定特征或线性组合来理解。 但是为了找到一个不受这些问题影响的基本描述单位,我们不仅在方向上进行了调整, 还要研究线性激活功能( 如 ReLU ) 如何将激活的空间分成许多离散的多功能。 我们称之为多功能透视镜, 多边透视镜像可以以具体的方式预测网络的神经性或方向, 因为我们无法再指定一个神经性框架的模型, 而我们通过这些模型来分析。