In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level). These results demonstrate the effectiveness and robustness of CodeRetriever.
翻译:在本文中,我们提出代码搜索模型,通过大型代码文本对比性培训前,学习功能级代码语义表达。我们在代码搜索中采用了两个对比式学习方案:单式对比学习和双式对比学习。对于单式对比学习,我们设计了一个不受监督的学习方法,以基于文档和函数名称建立语义相关代码配对。对于双式对比学习,我们利用文档和代码的线内评论来构建代码文本配对。两个对比性目标都可以充分利用大规模代码库进行预培训。广泛的实验结果显示,代码搜索模型在11个域/语言特定代码搜索任务上,与现有代码预培训模式相比,取得了显著改进,以不同代码颗粒化(功能级、断层和语义级)的6种语言编制代码搜索任务。这些结果显示了代码搜索的有效性和稳健性。