对神经语言模式和人的递增处理进行有针对性的评估 (A Targeted Assessment of Incremental Processing in Neural LanguageModels and Humans)

We present a targeted, scaled-up comparison of incremental processing in humans and neural language models by collecting by-word reaction time data for sixteen different syntactic test suites across a range of structural phenomena. Human reaction time data comes from a novel online experimental paradigm called the Interpolated Maze task. We compare human reaction times to by-word probabilities for four contemporary language models, with different architectures and trained on a range of data set sizes. We find that across many phenomena, both humans and language models show increased processing difficulty in ungrammatical sentence regions with human and model `accuracy' scores (a la Marvin and Linzen(2018)) about equal. However, although language model outputs match humans in direction, we show that models systematically under-predict the difference in magnitude of incremental processing difficulty between grammatical and ungrammatical sentences. Specifically, when models encounter syntactic violations they fail to accurately predict the longer reaction times observed in the human data. These results call into question whether contemporary language models are approaching human-like performance for sensitivity to syntactic violations.

翻译：我们通过收集16个不同综合测试套装的16种结构现象的逐字反应时间数据,对人类和神经语言模型的递增处理进行了有针对性的、扩大的比较; 人类反应时间数据来自名为国际刑警Maze任务的新型在线实验范例; 我们将人类反应时间与四种当代语言模型的逐字概率进行比较,与不同的结构进行比较,并就一系列数据集大小进行培训; 我们发现,在许多现象中,人类和语言模型都显示,在人类和模型“准确性”得分(la Marvin和Linzen(2018))大致相等的非语句区,处理过程难度越来越大; 然而,虽然语言模型产出与人的方向匹配,但我们显示模型系统地低估了语法和非语法句之间递增处理难度的差别。具体地说,当模型遇到同理性违规时,它们无法准确预测人类数据中观察到的较长反应时间。这些结果使人怀疑当代语言模型是否接近人类相似的性表现,以敏感度来应对合成违反现象。