Pre-trained language models for programming languages have shown a powerful ability on processing many Software Engineering (SE) tasks, e.g., program synthesis, code completion, and code search. However, it remains to be seen what is behind their success. Recent studies have examined how pre-trained models can effectively learn syntax information based on Abstract Syntax Trees. In this paper, we figure out what role the self-attention mechanism plays in understanding code syntax and semantics based on AST and static analysis. We focus on a well-known representative code model, CodeBERT, and study how it can learn code syntax and semantics by the self-attention mechanism and Masked Language Modelling (MLM) at the token level. We propose a group of probing tasks to analyze CodeBERT. Based on AST and static analysis, we establish the relationships among the code tokens. First, Our results show that CodeBERT can acquire syntax and semantics knowledge through self-attention and MLM. Second, we demonstrate that the self-attention mechanism pays more attention to dependence-relationship tokens than to other tokens. Different attention heads play different roles in learning code semantics; we show that some of them are weak at encoding code semantics. Different layers have different competencies to represent different code properties. Deep CodeBERT layers can encode the semantic information that requires some complex inference in the code context. More importantly, we show that our analysis is helpful and leverage our conclusions to improve CodeBERT. We show an alternative approach for pre-training models, which makes fully use of the current pre-training strategy, i.e, MLM, to learn code syntax and semantics, instead of combining features from different code data formats, e.g., data-flow, running-time states, and program outputs.
翻译:编程语言的预培训语言模型在处理许多软件工程(SE)任务(例如程序合成、代码完成和代码搜索)方面表现出了强大的能力。然而,还有待观察的是,这些模型的成功背后是什么。最近的研究已经研究过,预培训模式如何能有效地学习基于“抽象语法树”的语法信息。在本文中,我们找出了自我注意机制在理解基于 AST 和静态分析的代码语法和语义方面扮演什么角色。我们侧重于一个众所周知的代号模式(codBERT),并研究它如何通过自我注意机制以及代号语言模型(MLMM)来学习代码的语法和语义。我们建议了一个研究小组来分析代码。基于 AST 和静态分析,我们建立了代码之间的关系。首先,我们的结果显示, DCBERT可以通过自读和 MLMM 来获取语法前语法和语法知识。 其次,我们展示了自我注意机制让更多关注当前语系的语系和语法系关系,我们用不同的代码来显示不同的代码。