Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.
翻译:神经网络中的个体神经元通常代表不相干特征的混合。 这种现象被称为多抗性,可以使解释神经网络更加困难, 所以我们要了解其原因。 我们提议通过特质 \ emph{ capacity} 的透镜来做到这一点, 这是每个特质在嵌入空间中消耗的分数维度。 我们显示, 在玩具模型中, 最佳能力配置往往代表单体代表最重要的特征, 多抗性代表较不重要的特征( 与其对损失的影响成比例 ), 并且完全忽略了最不重要的特征。 当输入的神经网络具有较高的神经系统或孔径, 并且在某些结构中比其他结构中更为普遍时, 多耐性更为普遍。 根据最佳的能力配置, 我们继续研究嵌入空间的几何结构。 我们发现一个区块半正形结构, 不同模型的区块大小不同, 突出模型结构对其神经元的可判读性的影响 。