Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformer and non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformer-based LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning.
翻译:以变换语言为基础的语言模型(LMS)继续推进NLP基准任务的最新表现,包括模仿人启发的“共通”能力的任务。为了更好地了解LMS在多大程度上可以说具有某种语言推理技能,研究人员开始调整心理计量领域的工具和概念。但是,在什么程度上能从另一个方向产生效益?一.e.当测试项目提供给人类参与者时,LMS是否可用于预测测试项目的精神测量特性将是什么?我们从许多人类参与者和LMS(透明和非透明)对语言能力的广泛诊断测试中收集反应。我们然后使用这些反应来计算诊断测试中项目的标准心理特性,同时使用人类反应和LM回应分别进行。然后我们确定这两套预测在多大程度上与另一方向相匹配?一.e.当这些实验项目被授予给人类参与者时,LMS能够用来预测其精神测量特性会是什么?我们发现一些案例,基于变换的LMs对某些类别的心理特性持续地进行良好的预测,但在其他类别中却一直很差,从而对人类与LM逻辑之间的基本相似性和差异提供新的洞察。