Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models' syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb's conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model's syntactic knowledge: given a sentence, can it conjugate arbitrary verbs correctly? Second, evaluating a model's likely behavior: given a sentence, does the model concentrate its probability mass on correctly conjugated verbs, even if only on a subset of the possible verbs? We argue that current implementations of TSE do not directly capture either of these goals, and propose new metrics to capture each goal separately. Under our metrics, we find that TSE overestimates systematicity of language models, but that models score up to 40% better on verbs that they predict are likely in context.
翻译:对英语主题动词数协议的定向综合评价(TSE) 评估语言模型综合知识,使用手工制作的最小句子,只在主动词的相形之下有所不同。 方法评估语言模型对每个语法句的评分是否比非语法对应词的评分更为可能。 我们为TSE确定了两个不同的目标。 首先,评估语言模型综合知识的系统性: 给一个句子, 它能正确地调和任意动词? 其次, 评估模型的可能行为: 给一个句子, 模型的概率是否集中在正确的混和动词上, 即使只是放在可能的动词的一组? 我们说, 当前的TSE的落实并不直接捕捉这些目标中的任何一个, 并且提出新的指标来分别捕捉每个目标。 根据我们的衡量标准, 我们发现 TSE 高估语言模型的测算系统是否系统化, 但是模型在它们预测的动词上得分到的比比它可能要好40%。