Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow).
翻译:在过去几年里,源代码(例如PLBART, CoDT5, CCT5, CAT-Code)的预培训语言模型(例如PLBART, CoDT5, CAT-Code)在包括代码生成和翻译在内的若干任务方面产生了强有力的成果。这些模型采用了不同的培训前目标,以自我监督的方式从非常大规模的公司中学习代码构建统计数据;预培训模式的成功主要取决于这些培训前目标。本文建议一个新的培训前目标,即源代码的“质化”,利用代码的双模、双通道(正规和自然渠道)的性质。与自然语言不同,代码的双轨、双通道的性质使得我们可以产生规模的等同的代码。我们引入了六类语义保护转换,以引入非自然的代码形式,然后迫使我们的模型生成了由开发者编写的更自然的原始程序。 学习如何产生等同但更自然的代码,在规模上超越了开源代码的大型公司,没有明确的人工监督,有助于模型学习模型,在模型、流程流程和代码的生成过程中,我们进行了精细化的模型化的代码。