How should we compare the capabilities of language models (LMs) and humans? I draw inspiration from comparative psychology to highlight some challenges. In particular, I consider a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt -- substantially less content than the human training -- allows the LMs to consistently outperform the human results, and even to extrapolate to more deeply nested conditions than were tested with humans. Further, reanalyzing the prior human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans. This case study highlights how discrepancies in the evaluation can confound comparisons of language models and humans. I therefore reflect on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
翻译:我们应如何比较语言模型(LMS)和人类的能力?我应如何从比较心理学中汲取灵感,以突出一些挑战?我特别考虑一个案例研究:处理循环嵌套的语法结构。先前的工作表明,LMS无法象人类那样可靠地处理这些结构。然而,向人类提供了指示和培训,而LMS却被评估为零。因此,我更密切地匹配评估。向大型LMs提供简单的即时数据 -- -- 其内容远远少于人类培训的内容 -- -- 使LMs能够持续地超越人类结果,甚至外推到比人类所测试的更深的嵌套条件。此外,对以前的人类数据进行再分析表明,人类最初在困难的结构中可能不会超过机会。因此,大LMs的确可以像人类那样可靠地处理循环嵌套的语法结构。本案例研究强调,评价中的差异如何能混杂语言模型和人类的比较。我因此思考比较人与模型能力的更广泛挑战,并突出评估认知模型和基础模型之间的重要差异。