We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
翻译:我们为源代码模型提议Corder, 这是一种自我监督的对比式学习框架。 Corder 旨在减轻对代码检索和代码总和任务的标签数据的需求。 预先训练的 Corder 模式可以用两种方式使用:(1) 它可以生成可适用于代码检索任务、但没有标签数据的代码代表的代码;(2) 它可以在微调过程中用于仍需要代码总和等标签数据的任务。 关键的创新是, 我们通过要求它通过对比性学习目标识别相似和不同的代码片断来培训源代码模型。 为此, 我们使用一套语义保存转换操作器生成代码片断, 代码片断的组合是多种多样的, 但语义等同。 我们通过广泛的实验, 显示由 Corder 培训的代码模型大大超越了代码对代码检索、 文本对代码检索和代码对文本总和任务的其他基线 。