Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.
翻译:代码搜索旨在为特定自然语言查询检索具有语义相关性的代码片断。 最近, 使用对比学习的许多方法在代码代表学习方面显示出令人乐观的结果, 并大大提高了代码搜索的性能。 然而, 在使用对比学习代码搜索方面仍有很大的改进空间。 在本文中, 我们提议CoSoDa 能够通过两个有对比的学习关键因素( 数据扩增和负面样本) 有效地利用对比学习来进行代码搜索。 具体地说, 软数据增强是指以输入序列类型来动态遮盖或替换某些符号, 以生成积极的样本 。 动态机制被用来通过保持队列和动力编码编码编码加密来生成大量且一致的负面推进样本 。 此外, 多式对比学习被用来将代码配对的表达方式集中起来, 并推开未匹配的代码片断。 我们进行了广泛的实验, 以六种编程语言来评估我们在大规模数据集中的方法的有效性 。 实验结果显示:(1) CooSoDa 超越了14个基线, 特别是超过 用于 Exerbreal 的模型 。