代码编辑器: 学习用预培训模式编辑源代码 (CodeEditor: Learning to Edit Source Code with Pre-trained Models)

Developers often perform repetitive code editing activities for various reasons (e.g., code refactor) during software development. Many deep learning models are applied to automate code editing by learning from the code editing history. Recently, pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for code editing. In this paper, we propose a pre-training task specialized in code editing and present an effective pre-trained code editing model named CodeEditor. Our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect real-world code snippets as the ground truth and use a generator to rewrite them into natural but inferior versions. Then, we pre-train our CodeEditor to edit inferior versions into the ground truth, to learn edit patterns. We conduct experiments on four datasets and evaluate models in three settings. (1) In the fine-tuning setting, we fine-tune the pre-trained CodeEditor with four datasets. CodeEditor outperforms SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we fine-tune the pre-trained CodeEditor with limited data. CodeEditor substantially performs better than all baselines, even outperforming baselines that are fine-tuned with all data. (3) In the zero-shot setting, we evaluate the pre-trained CodeEditor without fine-tuning. CodeEditor correctly edits 1,113 programs while SOTA baselines can not work. The results prove that the superiority of our pre-training task and the pre-trained CodeEditor is more effective in automatic code editing.

翻译：开发者往往出于各种原因执行重复的代码编辑活动。许多深层次的学习模式被应用到从代码编辑历史中学习的代码编辑。最近, 预训练的代码编辑模式已经实现了最先进的代码编辑模式( SOTA) 。预训练模式首先经过培训, 包括培训前的任务, 并根据代码编辑任务进行微调。现有的培训前任务主要是从自然语言处理字段中衍生出来的代码( 例如, 隐蔽语言模型), 并且不是为代码编辑设计的。在本文件中, 我们提议了一个专门进行代码编辑的培训前任务, 并展示了一个有效的预训练的代码编辑模式。我们的培训前任务进一步提高了代码编辑模式的性能和总体化能力。具体地, 我们收集了真实世界代码的预修正功能, 并且使用一个更自然但低级的预版本。然后, 我们所有的代码编辑器都经过了测试, 将更低级的版本编辑器做了更精确的修改, 并且用4个模型来进行测试。我们进行测试了4个数据格式的测试。