To understand what kinds of linguistic knowledge are encoded by pretrained Chinese language models (LMs), we introduce the benchmark of Sino LINGuistics (SLING), which consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence. In contrast to the CLiMP dataset (Xiang et al., 2021), which also contains Chinese minimal pairs and was created by translating the vocabulary of the English BLiMP dataset, the minimal pairs in SLING are derived primarily by applying syntactic and lexical transformations to naturally-occurring, linguist-annotated sentences from the Chinese Treebank 9.0, thus addressing severe issues in CLiMP's data generation process. We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh, CPM) and multi-lingual (e.g., mT5, XLM) language models on SLING. Our experiments show that the average accuracy for LMs is far below human performance (69.7% vs. 97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested LMs, even much larger ones. Additionally, we find that most LMs have a strong gender and number (singular/plural) bias, and they perform better on local phenomena than hierarchical ones.
翻译:为了理解受过训练的中文模型(LMS)对何种语言知识进行编码,我们引入了中华语言学基准(SLing),该基准由中中文中中文中文本38K最低刑期配对组成,分为9个高语言现象。每对都展示了特定合成或语义学现象的可接受性对比(例如,钥匙丢失了,钥匙丢失了),而LM应对可接受的句子的严重问题。与CLIMP数据集(Xiang等人,2021年)相比,我们引入了包含中国最低刑期配对的中国语言基准(SLing)基准,该基准配对中中文配对共38K最低刑期配对,由翻译英文BLIMP数据集词汇中的38K最低刑期配对组成,这主要通过对自然生成的合成或语义变异(例如,钥匙丢失了对钥匙,钥匙丢失了),因此LMM(eleg-LMR) 数据生成过程中的严重问题。我们测试了18种公开使用的单语(eg,Bzh,CM),甚至包括了中国最低版本的词的词汇,而最精确的词汇模型显示了我们最高级的成绩。