代码PAD: 以序列为基础的代码生成, 带有按下自动自定义的代码生成 (CodePAD: Sequence-based Code Generation with Pushdown Automaton)

In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, the general failure to ensure grammatical correctness is a fatal drawback of sequence-based models, which are commonly used in code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with the PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in understanding PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage any existing sequence-based model, and we show that it relatively improves 17\% BLEU on CONALA, 8\% EM on DJANGO, and 55\% BLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., (BLEU, CodeBLEU) of CodeGen-350M improvement from (1.55, 3.21) to (14.44, 21.54) on MBPP in zero-shot setting.

翻译：在代码生成过程中,必须保证生成的代码能够满足编程语言(PL)的语法限制。然而,确保语法正确性的总体失败是代码生成中常用的基于序列的模型的致命缺陷。在本文件中,我们设计了一种基于自推自下自动马顿(PDA)的方法来解决这一问题,利用PDA所接受的PDA可识别语言和代码的一个子集的语法化原则。具体地说,我们建造了一个PDAU模块并设计了一个算法来限制基于序列的模型的生成,以确保语法正确性。但是,在这种方法的指导下,我们进一步提议代码PADAD,一个配有PDA模块的基于序列的代码生成框架,将PDA的扣减纳入深层学习。此外,这个框架可以利用PDA(包括州代表、州预测任务和与州之间的联合预测)来帮助模型理解PDA的推算。为了全面评估代码,我们为Pyson设计了一个改进的PDA,并在四个公共基准数据模型上进行广泛的实验。3.CODA可以将目前BMA的排序方法提高到17。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日