Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we show that our circuit generalizes to other tasks, playing a role in other greater-than scenarios.
翻译:预训练的语言模型在不同任务中表现出了出人意料的能力,但其是如何实现这些能力的却不为人所知。本文使用机械解释技术来解释预训练语言模型的数学能力。具体而言,我们通过案例分析,探讨GPT-2 Small模型如何通过输入如“The war lasted from the year 1732 to the year 17”这样的语句,预测一个大于32的合法二位数的年份。我们首先确定了一个电路,即GPT-2 Small模型计算此任务输出的计算图的一个子集。然后,我们解释了电路组件的作用,显示了GPT-2 Small模型的多层感知机网络如何提高输出中大于开始年份的概率。最后,我们证明我们的电路可以泛化到其他场景中,例如其他大于运算的情景。