How should we compare the capabilities of language models and humans? Here, I consider a case study: processing of recursively nested grammatical structures. Prior work has suggested that language models cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training before being evaluated, while the language models were evaluated zero-shot. I therefore attempt to more closely match the evaluation paradigms by providing language models with few-shot prompts. A simple prompt, which contains substantially less content than the human training, allows large language models to consistently outperform the human results. The same prompt even allows extrapolation to more deeply nested conditions than have been tested in humans. Further, a reanalysis of the prior human experiments suggests that the humans may not perform above chance at the difficult structures initially. These results suggest that large language models can in fact process recursively nested grammatical structures comparably to humans. This case study highlights how discrepancies in the quantity of experiment-specific context can confound comparisons of language models and humans. I use this case study to reflect on the broader challenge of comparing human and model capabilities, and to suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.
翻译:我们应如何比较语言模型和人的能力?在这里,我考虑一个案例研究:处理循环嵌套的语法结构。先前的工作表明,语言模型无法象人类能够做的那样可靠地处理这些结构。然而,在评估之前,向人类提供了指导和培训,而语言模型则被评估为零。因此,我试图通过提供几发的提示来更密切地匹配评价模式。简单快速(其内容远远少于人类培训的内容)使得大型语言模型能够持续地超越人类结果。同样及时甚至允许外推到比人类所测试的更深的巢状条件。此外,对先前人类实验的重新分析表明,人类在最初的艰难结构中可能不会比机会更成功。这些结果表明,大型语言模型在事实上可以与人类相对可比较的嵌套式的语法结构。本案例研究突出表明,具体语言模型数量的差异如何可以混杂对语言模型和人类结果的比较。我利用这一案例研究来思考在广泛比较人类和具体模型能力时所面临的更广泛挑战。