How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models.
翻译:语言模型如何处理需要多个推理步骤的复杂输入并没有得到很好理解。 先前的研究显示, 关于这些输入的中间值的信息可以从模型的启动中提取, 但不清楚该信息在哪里编码, 以及该信息是否确实在推理过程中使用。 我们引入了一种方法, 分析变换模型如何处理这些输入, 重点是简单的算术问题及其中间值。 要追踪中间值信息在哪里编码, 我们用主元件分析( PCA) 测量中间值与模型激活的相互关系。 然后, 我们通过操纵模型重量来进行因果干预。 这一干预显示, 通过追踪确定的重量不仅与中间值相关, 而且还与模型预测有因果相关。 我们的研究结果显示, 该模型具有某些中间值的位置, 这对于提高模型的可解释性很有帮助 。