Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.
翻译:培训前语言模型已被展示为对语言结构进行编码,例如,依赖性和选区分析树,在嵌入时,在对隐蔽性损失函数(如蒙面语言模型)进行培训时,在嵌入过程中,对隐蔽性损失函数(如蒙面语言模型)进行分解,有些疑问已经提出,这些模型是否真的在进行分解,或只是某些计算与其关系不大。我们研究以下问题:(a) 是否有可能以现实的嵌入维度、头数等来明确描述能够进行分解的变异器,甚至近似分解? (b) 为何在嵌入阶段前的模型会采集语言解析结构?本文在与 PCFG 的隐蔽性模型中,在解析性模型中为解析这些问题解析。 我们显示,隐藏语言模型的隐蔽性语言模型(如BERT或中值的ROBTA) 能够实施内端算算法。 我们还表明,内端算法对于掩藏语言模型损失的模型来说是最好的模式,但在PCG-CG 生成数据模型上,我们用50美元的正值排序的递解分解度模型来进行这种递化数据。</s>