Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21-24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.
翻译:软件工程中许多最近的模型都采用了基于变换器结构或使用基于变换器结构的深神经模型,或者使用了基于变换器的变异器的变异器,或者使用了基于变异器的变异器的变异器。尽管自然语言和编程语言之间存在差异,但目前的研究仍然依赖于NLP对这些模式代码的推理和做法。在解释代码如何建模方面,也有有限的文献。在这里,我们调查了对代码的PLM(PLM)关注度,并将其与自然语言进行比较。虽然这些模型在代码和错误检测等许多下游任务中取得了艺术状态。但是这些模型以变异器和 PLMM(P)为基础,主要在自然语言处理(NLP)领域研究这些变异器的变异器和变异变码。当BERT(BERT)的变异器实体、具体标识和分解器的改进后,我们分别对代码的变异的LSLDR(CS)的变异器和变异式观测结果,在使用我们最有色的变异性化的RLSLDSDRDM(O)的SD(O)的解算法中,在使用了比的变的变的变的变的变的变码中,用的解算法。