Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level \textbf{Code} \textbf{R}epresentation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure instead of code semantics. To encourage SCodeR to capture semantic information from the code, we utilize code comments and abstract syntax sub-trees of the code to build positive samples. We conduct experiments on four code-related tasks over seven datasets. Extensive experimental results show that SCodeR achieves new state-of-the-art performance on all of them, which illustrates the effectiveness of the proposed pre-training method.
翻译:在与代码有关的任务方面, 守则前对比性培训最近取得了显著进展。 在本文中, 我们展示了\ textbf{ SCoderR} 。 积极的样本构建是对比性培训前框架的又一个关键。 先前的工作使用变式重命名等等正数代码等基于转换的方法。 但是, 它们通常导致生成的代码的表面形式非常相似, 从而误导了侧重于表面代码结构而不是代码语义学的模式。 为了鼓励 ScodeR 通过迭代对立方式获取精细微的软标签, 并使用它们来学习更好的代码代表。 积极的样本构建是对比性前培训的又一个关键。 以前的工程使用变异重新命名等基于转换的方法生成等等正数代码。 但是, 它们通常导致生成的代码以非常相似的表面形式出现, 从而误导了侧重于表面代码结构而不是代码语义学的模式。 为了鼓励 ScodeR 获取来自代码的精细信息, 我们使用代码评论和抽象的合成子树来构建积极的样本。 我们利用了该代码的代码上的所有实验性实验性模型, 展示了四种方法的实验结果 。