We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.
翻译:我们引入了语言最小对角基准( 短到 BLIMP ), 这是评估语言模型( LMs) 了解英语主要语法现象的挑战。 BLIMP 由67个亚数据集组成, 每个包含1000个最小的对子, 分离语法、 形态学或语义学中的具体对比。 数据是按专家设计的语法自动生成的, 与标签的人类协议总量是96.4%。 我们用它来评估正克、 LSTM 和变异器( GPT-2 和变异器- XL) LMs 。 我们发现, 最先进的模型可以可靠地识别形态对比, 但它们在测量器分布、 负极性物品 和 精细的合成现象( 如采掘岛屿) 方面都面临语法限制 。