The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpretation methods mostly focus on probing models from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we introduce LM-Debugger, an interactive debugger tool for transformer-based LMs, which provides a fine-grained interpretation of the model's internal prediction process, as well as a powerful framework for intervening in LM behavior. For its backbone, LM-Debugger relies on a recent method that interprets the inner token representations and their updates by the feed-forward layers in the vocabulary space. We demonstrate the utility of LM-Debugger for single-prediction debugging, by inspecting the internal disambiguation process done by GPT2. Moreover, we show how easily LM-Debugger allows to shift model behavior in a direction of the user's choice, by identifying a few vectors in the network and inducing effective interventions to the prediction process. We release LM-Debugger as an open-source tool and a demo over GPT2 models.
翻译:以变压器为基础的变压器语言模型(LMS)的不透明性质和无法解释的行为引起了人们对解释其预测的广泛兴趣。然而,目前的解释方法主要侧重于外部的测试模型、实施行为测试和分析突出输入特征,而内部预测构建过程基本上不为人们所理解。在这项工作中,我们为以变压器为基础的LMLM引入了一个互动调试器工具LM-Debugger,它为变压器基于变压器的LM2提供了一种互动调试器工具,为模型的内部预测过程提供了精确的解析,并为干预LM行为提供了一个强有力的框架。对于它的骨干,LM-Dugger依赖一种最新的方法,该方法通过词汇空间的进料前层来解释内代号表示的表达方式及其更新。我们展示了LM-Dbugger用于单处调试的效用,我们检查了GPT2的内部调试过程。 此外,我们展示了LM-Debuger能够如何轻易地改变模式向用户选择的方向转变模式,方法是查明网络中的一些矢量,并引导有效的干预过程。我们将LM-Dbuger作为开放的工具。