Recently, deep learning techniques have shown great success in automatic code generation. Inspired by the code reuse, some researchers propose copy-based approaches that can copy the content from similar code snippets to obtain better performance. Practically, human developers recognize the content in the similar code that is relevant to their needs, which can be viewed as a code sketch. The sketch is further edited to the desired code. However, existing copy-based approaches ignore the code sketches and tend to repeat the similar code without necessary modifications, which leads to generating wrong results. In this paper, we propose a sketch-based code generation approach named SkCoder to mimic developers' code reuse behavior. Given a natural language requirement, SkCoder retrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code. Our motivations are that the extracted sketch provides a well-formed pattern for telling models "how to write". The post-editing further adds requirement-specific details to the sketch and outputs the complete code. We conduct experiments on two public datasets and a new dataset collected by this work. We compare our approach to 20 baselines using 5 widely used metrics. Experimental results show that (1) SkCoder can generate more correct programs, and outperforms the state-of-the-art - CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets. (2) Our approach is effective to multiple code generation models and improves them by up to 120.1% in Pass@1. (3) We investigate three plausible code sketches and discuss the importance of sketches. (4) We manually evaluate the generated code and prove the superiority of our SkCoder in three aspects.
翻译:最近,深层次的学习技术在自动代码生成中表现出了巨大的成功。 在代码再利用的启发下, 一些研究人员提议了基于复制代码片断内容的复制法, 以复制类似代码片断的内容, 以获得更好的性能。 实际上, 人类开发者认识到类似代码中与他们需要相关的内容, 可以将其视为代码草图。 草图被进一步编辑到想要的代码。 但是, 现有的基于复制法方法忽略代码草图, 并倾向于重复类似的代码, 而无需做必要的修改, 从而导致错误的结果。 在本文中, 我们提议了一种基于草图的代码生成方法, 名为 SkCoder, 模拟开发者代码再利用代码片断行为。 根据自然语言的要求, SkCoder 获取了一个类似的代码片断, 将相关部分作为代码草图, 并编辑到想要的代码中。 我们的草图提供了一种完善的模式。 后版进一步增加了素描和完整的代码的具体细节。 我们用两个公共数据集进行实验, 和新收集的数据集。 根据自然语言需要, 我们用 将我们的方法, 复制了一个代码的30个模型的模型, 。 我们用广泛使用的模型的模型来评估了 。