具有结构和功能属性的源代码差异学习 (Contrastive Learning for Source Code with Structural and Functional Properties)

Pre-trained transformer models have recently shown promises for understanding the source code. Most existing works expect to understand code from the textual features and limited structural knowledge of code. However, the program functionalities sometimes cannot be fully revealed by the code sequence, even with structure information. Programs can contain very different tokens and structures while sharing the same functionality, but changing only one or a few code tokens can introduce unexpected or malicious program behaviors while preserving the syntax and most tokens. In this work, we present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We first employ automated, structure-guided code transformation algorithms that generate (i.) functionally equivalent code that looks drastically different from the original one, and (ii.) textually and syntactically very similar code that is functionally distinct from the original. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective. To encode the structure information, we introduce a new node-type masked language model objective that helps the model learn about structural context. We pre-train BOOST with a much smaller dataset than the state-of-the-art models, but our small models can still match or outperform these large models in code understanding and generation tasks.

翻译：培训前变压器模型最近显示理解源代码的希望。大多数现有工程都希望理解源代码的文字特征和有限的代码结构知识。但是, 程序功能有时无法完全通过代码序列来显示。程序功能可以包含非常不同的符号和结构, 同时共享相同的功能, 但是只修改一个或几个代码符号可以引入意外或恶意程序行为, 同时保存语法和多数符号。在此工作中, 我们展示了一个全新的自我监督模型BOOST, 这是一种基于源代码特性的训练前重点的新颖自我监督模型。我们首先使用自动的、结构引导代码转换算法, 生成( 一) 功能等效代码, 它看起来与原代码有很大不同, 并且( 二) 文本和非常相近的代码代码代码, 它在功能上与原功能不同。我们用一个对比的学习目标来训练我们的模型, 使功能等同的代码更加接近和不同的代码。为了对结构信息进行编译, 我们首先采用了一个新的非格式化的、隐藏式的代码模型, 从而帮助模型在结构上学习大得多的模型。