Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful vectors, known as concept vectors, in activation space instead of individual neurons. The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our method can search for fine-grained concepts according to the user's desired level of concept separation. The analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. Our evaluations show that the concept vectors found encode coherent, human-understandable features.
翻译:机制可解释性旨在通过将神经网络分解成可解释的单元来理解其如何存储表示。然而,多义神经元或响应于多个不相关特征的神经元的出现使得解释单个神经元变得具有挑战性。这导致了在激活空间中寻找有意义的向量(称为概念向量)而非单个神经元的研究。本文的主要贡献是提出一种将多义神经元解环绕为封装不同特征的概念向量的方法。我们的方法可以根据用户所需的概念分离级别搜索精细的概念。分析表明,多义神经元可以被分解为线性组合方向的概念向量。我们的评估表明,发现的概念向量编码了连贯的、易懂的特征。