Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODE-MVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.
翻译:近年来,人们日益关注代码代表学习,其目的是在分布式矢量中代表源代码的语义,目前,提出了各种工作,以代表不同观点中源代码的复杂语义,包括纯文本、简易语法树(AST)和几种代码图(如控制/数据流图),然而,大多数文件只考虑对源代码的单一观点,忽视不同观点之间的对应关系。在本文件中,我们提议将对源代码的自然语言描述的不同观点纳入一个与多视对比性预科的统一框架,并将我们的模型命名为CODE-MVP。具体地说,我们首先利用编译工具提取多种代码观点,并在对比性学习框架内学习它们之间的补充信息。在编译中进行类型检查的启发下,我们还设计了一种精细的源代码推断目标。在五个数据集上的三个下游任务实验表明,与若干最先进的基准相比,CODE-MP的优越性框架。具体地说,我们首先利用编译工具提取多个代码,我们分别在RMA/RMS/Rampcoreal的自然代码检索中实现了。