Recently, deep neural networks (DNNs) have been widely applied in programming language understanding. Generally, training a DNN model with competitive performance requires massive and high-quality labeled training data. However, collecting and labeling such data is time-consuming and labor-intensive. To tackle this issue, data augmentation has been a popular solution, which delicately increases the training data size, e.g., adversarial example generation. However, few works focus on employing it for programming language-related tasks. In this paper, we propose a Mixup-based data augmentation approach, MixCode, to enhance the source code classification task. First, we utilize multiple code refactoring methods to generate label-consistent code data. Second, the Mixup technique is employed to mix the original code and transformed code to form the new training data to train the model. We evaluate MixCode on two programming languages (JAVA and Python), two code tasks (problem classification and bug detection), four datasets (JAVA250, Python800, CodRep1, and Refactory), and 5 model architectures. Experimental results demonstrate that MixCode outperforms the standard data augmentation baseline by up to 6.24\% accuracy improvement and 26.06\% robustness improvement.
翻译:最近,深入的神经网络(DNNs)被广泛应用于语言的编程理解中。一般而言,对具有竞争性性能的DNN模式的培训需要大量和高质量的标签培训数据。然而,这些数据的收集和标签是耗费时间和劳动密集型的。为解决这一问题,数据增强是一个流行的解决办法,它微妙地增加了培训数据的规模,例如对抗性范例生成。然而,很少有人将它用于语言相关任务的编程工作作为重点。在本文件中,我们提议采用基于混合的数据增强方法Mixcode,即MixCode,以加强源代码分类任务。首先,我们使用多种代码重组方法生成与标签一致的代码数据。第二,采用混合技术来混合原始代码和转换代码,以形成新的培训数据模型。我们评估两种编程语言的MixCode(JAVA和Python)的MixCodead,两项代码任务(问题分类和错误检测)、四个数据集(JAVA250、Pyson800、Cdrepop1和Refactory),以及5个模型模型模型模型改进了Msimalimalimalims。