CDCRL:通过预培训模式和深强化学习掌握代码生成 (CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning)

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.

翻译：程序合成或代码生成旨在生成一个符合问题规格的方案。最近使用大规模预先培训的语言模型(LMS)的方法显示了有希望的结果,但还是有一些关键的局限性。特别是,它们往往遵循标准监督的微调程序,仅从一对自然语言问题描述和地面真相方案来培训代码生成模型。这种模式基本上忽视了问题规格中一些重要但可能有用的信号,如单位测试,因此在解决复杂的无形编码任务时往往导致业绩不佳。为了解决这些局限性,我们建议“CoderRL”,通过预先培训LMS和深层强化学习(RL),为方案合成任务建立一个新的框架。具体地说,在培训过程中,我们把代码生成LMM作为一个行为者网络,引入一个经过培训的批评网络,以预测生成的方案的功能正确性,并向行为者提供密集的反馈信号。在推断中,我们引入了一种新的生成程序,其关键抽样战略允许基于单位测试和批评得分反馈的自动再生化程序。对于模型的骨干,我们还在模型骨干中,我们将创建了更具有挑战性的LMBSB5的基级基准,我们还在学习了更具有挑战性的目标上,我们的数据基准结构结构。