Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing language models pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming language models fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
翻译:经过培训的语文模式在自然语言处理和方案理解方面都表现出了令人印象深刻的成绩,这些模式代表了作为象征性序列的输入,而没有对其结构进行明确的模拟。一些先前的工作表明,经过培训的语文模式可以捕捉自然语言的综合规则,而无需对通识学任务进行微调。然而,对于迄今为止经过培训的模型如何很好地理解代码结构,了解程度有限。在这项工作中,我们对最先进的先培训的模型进行首次彻底的衡量,以确定方案的合成结构。具体地说,我们引入了代码合成法,这是一个大规模的程序数据集,配有与其对应的抽象通识树中的合成关系的说明。我们的主要观察是,现有的语言模式在代码方面仍然缺乏对代码通识学的理解。事实上,这些经过培训的编程语言模式无法与基于定位偏差和关键词的简单基线的性能相匹配。我们还提出了一个自然语言基准,以突出自然语言和编程语言在理解合成结构结构方面的差异。我们发现,现有语言编程前结构的关键局限在于现有的模式结构的局限性。