In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, neglecting grammar constraints is a fatal drawback of commonly used sequence-based code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage existing sequence-based models, and we show that it can achieve 100\% grammatical correctness percentage on these benchmark datasets. Thus, it relatively improve 17\% CodeBLEU on CONALA, 8\% EM on DJANGO, and 15\% CodeBLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., CodeBLEU of CodeGen-350M improvement from 3.21 to 21.54 on MBPP in zero-shot setting.
翻译:在代码生成过程中,必须保证生成的代码符合编程语言(PL)的语法限制。然而,忽略语法限制是常用序列生成代码的致命缺陷。在本文中,我们设计了一种基于自推自下自动地图(PDA)的方法来解决这个问题,利用PDA可识别的语言和代码的子集这一原则,PDA所接受的代码是语法学的。具体地说,我们建造了一个PDAU模块并设计了一个算法来限制基于序列的模型的生成,以确保语法正确性。在这种方法的指导下,我们进一步提议一个基于序列的代码生成框架,即配有PDA模块的基于序列的代码生成框架,将PDA的扣减纳入深层次学习。此外,这个框架可以利用PDADA的推算(包括州代表、州预测任务和州与州之间的联合预测)来帮助模型学习PDA的推算。为了全面评估代码,我们为Python设计了一个基于序列的模型,并在四个公共基准数据集上进行广泛的实验。 CodePDPADADDD可以利用现有的序列模型, 3-DA模型,我们在基准模型上对基准模型进行比较比较,我们用现有的模型,在基准模型上可以比现有的序列模型,我们比较,在基准模型上比较这些模型, AL-CODMCMCFMCRCRCFMCRCRCRCRCRCRCR 5,我们用比较了100。