Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.
翻译:最近,提出了许多源代码的预先培训语言模型,以建模代码背景,并作为代码完成、代码搜索和代码汇总等下游代码情报任务的基础。这些模型利用了蒙面培训前和变换器,取得了令人乐观的成果。然而,目前,在现有的经培训前代码模型的可解释性方面进展甚微。不清楚为什么这些模型起作用,以及它们能够捕捉到的关联特征。在本文件中,我们进行了彻底的结构分析,目的是从三个不同的角度解释源代码(例如,代码BERT和GreaterCodeBERT)的预先培训语言模型。这些模型从三个不同的角度:(1) 关注分析,(2) 对嵌入的词进行探测,(3) 合成税树诱导。通过全面分析,本文件揭示了若干有见见的结论,可以激发未来的研究:(1) 注意与代码的合成税结构紧密一致。(2) 培训前语言模型可以在每个变换层的中间表述中保留代码的语法结构。(3) 预先培训前的代码模型有能力将编码的合成树引入代码的更好结构。