We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.
翻译:本文介绍TurBLiMP,首个土耳其语语言学最小对基准测试集,旨在评估单语与多语语言模型的语言能力。该基准涵盖16种语言现象,每种现象包含1000个最小对,填补了土耳其语语言评估资源的重要空白。在设计过程中,我们特别关注土耳其语中在当前语言模型句法评估中尚未充分研究的两个特性:词序灵活性与通过形态过程实现的从属关系。通过对多种语言模型及新收集的人类可接受性判断数据集的实验,我们发现即使最先进的大型语言模型仍难以处理对人类而言不具挑战性的语法现象,且在词序与形态复杂性方面可能表现出与人类不同的敏感度。