General-purpose code generation aims to automatically convert the natural language (NL) description to code snippets in a general-purpose programming language (GPL) like Python. Intrinsically, code generation is a special type of text generation that generates well-formed text, i.e., code. However, existing sequence-to-sequence (Seq2Seq) approaches generate the GPL code neglecting the grammar rules. To this end, in this paper, we make the first attempt to consider grammatical Seq2Seq models for general-purpose code generation and propose CODEP, a grammatical Seq2Seq code generation framework equipped with a Pushdown automaton (PDA) module. In the training stage, CODEP additionally incorporates the state representation and the state prediction task, which leverages PDA states to help CODEP comprehend the parsing process of the PDA module. In the inference stage, CODEP generates well-formed code with the PDA module and the joint prediction of PDA states. Furthermore, the PDA module can be directly applied to Seq2Seq models without training to ensure the grammatical correctness of the generated code. To evaluate the effectiveness of our proposed method, we construct the DPA for the most popular GPL Python and conduct extensive experiments on four benchmark datasets. The experimental results demonstrate the superiority of CODEP compared to the state-of-the-art approaches without pre-training, and the DPA module also achieves significant improvements on the pre-trained models.
翻译:通用代码生成旨在将自然语言描述(NL)自动转换为像 Python 这样的通用编程语言(GPL) 的代码片段。 从本质上讲,代码生成是一种特殊的文本生成类型,它生成了完善的文本,即代码。然而,现有的顺序到顺序方法(Seq2Seqeq) 生成了GPL 代码,忽略了语法规则。为此,在本文件中,我们首次尝试考虑通用代码生成的语法Seq2Seqeq 模型,并提出了CODEP(CDEP),这是一个配有“推下自调自动地图(PDA)”模块的语法生成框架。在培训阶段,代码生成代码生成是一种特殊的文本生成特殊类型的文本生成模式。 PDADA 最高级的语法模型可以直接用于SDAFRA 的测试前测试。