Transformer-based language models (LMs) continue to achieve state-of-the-art performance on natural language processing (NLP) benchmarks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts from psychometrics. But to what extent can benefits flow in the other direction? In other words, can LMs be of use in predicting the psychometric properties of test items, when those items are given to human participants? If so, the benefit for psychometric practitioners is enormous, as it can reduce the need for multiple rounds of empirical testing. We gather responses from numerous human participants and LMs (transformer- and non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions correlate. We find that transformer-based LMs predict the human psychometric data consistently well across most categories, suggesting that they can be used to gather human-like psychometric data without the need for extensive human trials.
翻译:以变异语言为基础的语言模型(LMS)继续达到自然语言处理(NLP)基准的最先进性能,包括旨在模仿受人启发的“共通”能力的任务。为了更好地了解LMS在多大程度上可以说具有某种语言推理技能,研究人员开始从心理计量学上调整工具和概念。但是,在何种程度上可以从另一个方向流动?换句话说,LMS能否用于预测测试项目的精神测量特性,当这些物品被提供给人类参与者时?如果是这样的话,对心理计量从业者的好处是巨大的,因为它可以减少多轮实验性测试的需要。我们从许多人类参与者和LMS(转基因和非转基因)那里收集对语言能力的广泛诊断性测试的答复。我们然后使用人类反应来计算诊断测试中物品的标准心理测量特性,同时使用人类反应和LMM反应来分别确定这两组预测的关联性。我们发现,基于变异的LMS对人类心理计量数据进行不需在人类最广泛的类别中一致地收集。我们发现,这意味着,在人类心理计量数据方面,它们可以不需在最广泛的类别中持续地收集。