Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code is difficult to meet the syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative programs. In this study, we summarize the lack of syntactic constraints into three significant challenges: (1) the efficient representation of syntactic constraints, (2) the effective integration of syntactic information, and (3) the scalable syntax-first decoding algorithm. To address these challenges, we propose a syntax-guided multi-task learning approach TurduckenGen. Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints. Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code. Finally, the syntactically correct code is selected accurately from the multiple candidates with the help of the compiler feedback. Extensive experiments and comprehensive analysis demonstrate the effectiveness and general applicability of our approach after being compared with six state-of-the-art baselines on two Turducken-style code datasets. Finally, we conducted a human study and found the code quality generated by our approach is better than baselines in terms of code readability and semantic similarity.
翻译:由于开发了经过事先训练的语言模型,自动化代码生成技术近年来显示出很大的希望,然而,生成的代码很难满足目标语言的综合限制,特别是在Turducken式代码中,该代码将声明代码片断嵌入强制程序。在本研究中,我们总结出缺乏合成制约的三大挑战:(1) 有效表述综合制约因素,(2) 有效整合合成信息,(3) 缩略式的首级合成解码算法。为了应对这些挑战,我们建议采用以合成税为指南的多任务学习方法TurduckenGen。具体地说,我们首先将类型信息明确附加在代码符号中,以体现合成制约的表示。然后,我们将缺乏合成制约的整合制约归纳成一个辅助性任务,使模型能够学习该代码的合成制约。最后,从多个候选人中准确选择了可缩略式的正确代码,同时帮助进行编译者反馈。 广泛的实验和全面分析将类型信息附加在代码中,最终通过比我们制定的人类代码的6项基线和总体数据应用性,通过比我们所制定的6项基线和总体数据方法,通过阅读了比我们所发现的6项基准和总体数据方法。</s>