Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and expanded it through multi-modality and multi-tasking, yet the data for downstream tasks remain modest in size. Focusing on data utilization for downstream tasks, we propose and adapt augmentation methods that yield consistent improvements in code translation and summarization by up to 6.9% and 7.5% respectively. Further analysis suggests that our methods work orthogonally and show benefits in output code style and numeric consistency. We also discuss test data imperfections.
翻译:自然语言处理方面的进展,例如从经过培训的语文模式中传授知识,已经影响到如何为语言任务编程模式的培训。 以前的研究主要探索了编程前培训,并通过多模式和多任务扩大了编程范围,但下游任务的数据规模仍然不大。 我们以下游任务的数据利用为重点,提出并调整扩增方法,使编码翻译和总结的不断改进分别达到6.9%和7.5%。进一步的分析表明,我们的方法在工作上是合而为一的,显示了产出代码样式和数字一致性方面的效益。 我们还讨论了测试数据不完善之处。