With the rapid development of pre-training techniques, a number of language models have been pre-trained on large-scale code corpora and perform well in code generation. In this paper, we investigate how to equip pre-trained language models with the ability of code generation for private libraries. In practice, it is common for programmers to write code using private libraries. However, this is a challenge for language models since they have never seen private APIs during training. Motivated by the fact that private libraries usually come with elaborate API documentation, we propose a novel framework with two modules: the APIRetriever finds useful APIs, and then the APICoder generates code using these APIs. For APIRetriever, we present a dense retrieval system and also design a friendly interaction to involve uses. For APICoder, we can directly use off-the-shelf language models, or continually pre-train the base model on a code corpus containing API information. Both modules are trained with data from public libraries and can be generalized to private ones. Furthermore, we craft three benchmarks for private libraries, named TorchDataEval, MonkeyEval, and BeatNumEval. Experimental results demonstrate the impressive performance of our framework.
翻译:随着培训前技术的迅速发展,一些语言模式在大型代码公司上已经接受了预先培训,并很好地完成了代码生成。在本文中,我们调查了如何装备受过培训的语言模式,为私人图书馆提供代码生成能力。实际上,程序员通常使用私人图书馆编写代码。然而,这是语言模式的一个挑战,因为他们在培训期间从未见过私人的API。由于私人图书馆通常提供详细的API文件,我们提出了一个新框架,有两个模块:APIRetriever发现有用的API,然后ApICoder使用这些API生成代码。对于APIRetriever,我们提出一个密集的检索系统,并设计一种友好的互动,以便使用。对于APICoder,我们可以直接使用现成的语言模式,或者持续将基础模型用于包含API信息的代码堆中。两个模块都经过了公共图书馆的数据培训,可以推广到私人数据库中。此外,我们为私人图书馆设计了三个基准,名为TorchDataEval、MascarEval、Beatval、BAreval、BAVALN。