Model interpretability methods are often used to explain NLP model decisions on tasks such as text classification, where the output space is relatively small. However, when applied to language generation, where the output space often consists of tens of thousands of tokens, these methods are unable to provide informative explanations. Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. To disentangle the different decisions in language modeling, we focus on explaining language models contrastively: we look for salient input tokens that explain why the model predicted one token instead of another. We demonstrate that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena, and that they significantly improve contrastive model simulatability for human observers. We also identify groups of contrastive decisions where the model uses similar evidence, and we are able to characterize what input tokens models use during various language generation decisions.
翻译:模型解释方法通常用于解释关于文本分类等任务的模型决定,因为输出空间相对较小。然而,当应用到语言生成时,当输出空间通常由数万个符号组成时,这些方法无法提供说明性的解释。语言模型必须考虑各种特性来预测符号,例如其语言、数字、时态或语义部分。现有的解释方法将所有这些特征的证据混为一种单一的解释,这种解释性方法对于人类理解而言不那么容易解释。为了分解语言建模中的不同决定,我们侧重于解释语言模型:我们寻找突出的投入符号,解释为什么模型预测一个符号而不是另一个符号。我们表明,对比性解释比核实主要语法现象时的非争议性解释要好得多,而且它们大大改进了人类观察员的对比性模型模拟性。我们还确定了模型使用类似证据的对比性决定组,我们可以辨别各种语言生成决定中使用的输入符号。