The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a \textit{syntactic subspace}, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.
翻译:训练前语言模型的目标是学习文本数据的背景表达。 训练前语言模型已成为自然语言处理和代码模型的主流。 使用探测器( 一种研究隐性矢量空间语言特性的技术), 先前的工作显示, 这些经过训练的语言模型在隐藏的表达器中将简单的语言特性编码。 但是, 先前的工作没有评估这些模型是否将一个编程语言的整个语法结构编码。 在本文中, 我们证明存在一个包含预先训练语言模型的隐藏表达式中的\ textit{ syntal subspace} 。 此外, 我们强调该模型的中间层可以从模型的表达式中提取, 并定义一种新的演示方法( AST-Probe ), 以便能够恢复一个输入代码片断的全抽象语法树( AST) 。 在我们的实验中, 我们证明这个合成子空间在五个状态的编程前语言模型中的存在。 此外, 我们强调, 这个模型的中间层是该模型的缩略性语言模型的缩略性部分, 它的缩略性模型在最后显示AST 。