We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.
翻译:我们介绍PanGu-Coder,这是采用PanGu-Alpha生成文本到代码结构的事先培训的单一解码器语言模型,即根据自然语言问题描述综合编程语言解决方案。我们用两阶段战略培训PanGu-Coder:第一阶段使用Causal语言建模(CLM)对原始编程语言数据进行预培训,而第二阶段则使用Causal语言建模和隐形语言建模(MLM)相结合的培训目标,重点是文本到代码生成的下游任务和关于自然语言方案定义和代码功能的松散组合的培训。最后,我们讨论PanGu-Coder-FT,它精细调整了竞争性编程问题和代码的组合,并不断进行整合测试。我们评估PanGu-Coder,重点是它是否产生功能正确的程序,并表明它比代码X等类似规模的模式取得等或更好的业绩,同时参加一个较小的背景窗口和数据较少的培训。