Code pre-trained models have shown great success in various code-related tasks, such as code search, code clone detection, and code translation. Most existing code pre-trained models often treat a code snippet as a plain sequence of tokens. However, the inherent syntax and hierarchy that provide important structure and semantic information are ignored. The native derived sequence representations of them are insufficient. To this end, we propose CLSEBERT, a Contrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model, to deal with various code intelligence tasks. In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST) and leverage the Contrastive Learning (CL) to learn noise-invariant code representations. Besides the original masked language model (MLM) objective, we also introduce two novel pre-training objectives: (1) ``AST Node Edge Prediction (NEP)'' to predict edges between nodes in the abstract syntax tree; (2) ``Code Token Type Prediction (TTP)'' to predict the types of code tokens. Extensive experiments on four code intelligence tasks demonstrate the superior performance of CLSEBERT compared to state-of-the-art at the same pre-training corpus and parameter scale.
翻译:经过事先培训的代码模型在各种与代码有关的任务(如代码搜索、代码克隆检测和代码翻译)中表现出了巨大的成功。大多数经过事先培训的代码模型往往将代码片断作为简单的象征序列。然而,提供重要结构和语义信息的内在语法和等级被忽略。它们本地衍生的序列表示不充分。为此,我们提议CLSEBERT, 即“语法强化代码预加工模型的相悖学习框架”, 以处理各种代码情报任务。在培训前阶段, 我们考虑“抽象语法树(AST)”中包含的代码词汇和等级,并利用对比学习(CL)学习的代码表达方式。除了原始的隐蔽语言模型(MLMM)的目标外,我们还引入两个新的培训前目标:(1) “AST Node Edge 预测(NEP) ” 来预测抽象合成代码树中节点之间的边缘;(2) “Tode Ty Timillion(TP)” 来预测代码树(TTP) 中的代码词汇和结构图象学前标准的等级, 以显示C-REBS-SBSLA前的高级测试前标准。