When language models process syntactically complex sentences, do they use their representations of syntax in a manner that is consistent with the grammar of the language? We propose AlterRep, an intervention-based method to address this question. For any linguistic feature of a given sentence, AlterRep generates counterfactual representations by altering how the feature is encoded, while leaving intact all other aspects of the original representation. By measuring the change in a model's word prediction behavior when these counterfactual representations are substituted for the original ones, we can draw conclusions about the causal effect of the linguistic feature in question on the model's behavior. We apply this method to study how BERT models of different sizes process relative clauses (RCs). We find that BERT variants use RC boundary information during word prediction in a manner that is consistent with the rules of English grammar; this RC boundary information generalizes to a considerable extent across different RC types, suggesting that BERT represents RCs as an abstract linguistic category.
翻译:当语言模型处理综合复杂的句子时,它们是否以与语言语法一致的方式使用其语法表达方式? 我们提议采用AlterRep, 这是一种以干预为基础的方法来解决这一问题。 对于某一句子的任何语言特征, AlterRep 产生反事实表述, 改变该词是如何编码的, 同时保留原始表达式的所有其他方面。 当这些反事实表达式被替换为原始表达式时, 我们可以测量模型单词预测行为的变化, 从而得出有关语言特征在模型行为中的因果关系的结论。 我们采用这种方法来研究不同大小的BERT模型如何处理相对条款(RCs)。 我们发现, BERT 变量在单词预测中使用了符合英语语法规则的RC边界信息; 这种RC边界信息在相当程度上概括了不同驻地协调员类型的语言, 表明BERT代表RCs为抽象语言类别。