We describe an "interpretability illusion" that arises when analyzing the BERT model. Activations of individual neurons in the network may spuriously appear to encode a single, simple concept, when in fact they are encoding something far more complex. The same effect holds for linear combinations of activations. We trace the source of this illusion to geometric properties of BERT's embedding space as well as the fact that common text corpora represent only narrow slices of possible English sentences. We provide a taxonomy of model-learned concepts and discuss methodological implications for interpretability research, especially the importance of testing hypotheses on multiple data sets.
翻译:我们描述分析BERT模型时产生的“解释性错觉 ” 。 网络中个体神经元的激活可能假想地将单一的简单概念编码为编码,而事实上,这些神经元正在编码更为复杂的东西。 同样的效果也适用于激活的线性组合。 我们追踪这种错觉的来源是BERT嵌入空间的几何特性,以及共同文本子体只代表可能的英语句子的狭小部分。 我们提供了模型学概念的分类,并讨论了可解释性研究的方法影响,特别是测试多个数据集的假设的重要性。