Recently, pre-trained programming language models such as CodeBERT have demonstrated substantial gains in code search. Despite showing great performance, they rely on the availability of large amounts of parallel data to fine-tune the semantic mappings between queries and code. This restricts their practicality in domain-specific languages with relatively scarce and expensive data. In this paper, we propose CDCS, a novel approach for domain-specific code search. CDCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages (such as Java and Python), and is further adapted to domain-specific languages such as SQL and Solidity. Unlike cross-language CodeBERT, which is directly fine-tuned in the target language, CDCS adapts a few-shot meta-learning algorithm called MAML to learn the good initialization of model parameters, which can be best reused in a domain-specific language. We evaluate the proposed approach on two domain-specific languages, namely, SQL and Solidity, with model transferred from two widely used languages (Python and Java). Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.
翻译:最近,CodBERT等经过事先培训的编程语言模型在代码搜索方面取得了长足的进展。尽管表现良好,但它们依靠大量平行数据来微调查询和代码之间的语义绘图。这限制了其在特定领域语言中的实际实用性,其数据相对稀缺且费用昂贵。在本文件中,我们建议CDCS,这是对特定领域代码搜索的一种新颖方法。CDCS使用一个传输学习框架,初步方案代表模式在大量通用语言(如Java和Python)上预先培训,并进一步适应SQL和Solidicity等特定领域的语言。与直接调整目标语言的跨语言代码代码BERT不同,CDCS调整了几张通用的元学习算法,称为MAML,以学习模型参数的良好初始化,这种参数最好在特定领域语言中重新使用。我们评估了两种特定领域语言(即SQL和Solidicity)的拟议方法,从两种广泛使用的语言(Python和Java)中转移的模式。实验结果显示,CDCS在直接调整常规代码之前,这种模型是直接调整的。