In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, the general failure to ensure grammatical correctness is a fatal drawback of sequence-based models, which are commonly used in code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with the PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in understanding PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage any existing sequence-based model, and we show that it relatively improves 17\% BLEU on CONALA, 8\% EM on DJANGO, and 55\% BLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., (BLEU, CodeBLEU) of CodeGen-350M improvement from (1.55, 3.21) to (14.44, 21.54) on MBPP in zero-shot setting.
翻译:在代码生成过程中,必须保证生成的代码能够满足编程语言(PL)的语法限制。然而,确保语法正确性的总体失败是代码生成中常用的基于序列的模型的致命缺陷。在本文件中,我们设计了一种基于自推自下自动马顿(PDA)的方法来解决这一问题,利用PDA所接受的PDA可识别语言和代码的一个子集的语法化原则。具体地说,我们建造了一个PDAU模块并设计了一个算法来限制基于序列的模型的生成,以确保语法正确性。但是,在这种方法的指导下,我们进一步提议代码PADAD,一个配有PDA模块的基于序列的代码生成框架,将PDA的扣减纳入深层学习。此外,这个框架可以利用PDA(包括州代表、州预测任务和与州之间的联合预测)来帮助模型理解PDA的推算。为了全面评估代码,我们为Pyson设计了一个改进的PDA,并在四个公共基准数据模型上进行广泛的实验。3.CODA可以将目前BMA的排序方法提高到17。