Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRep1, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.
翻译:由于深神经网络在自然语言处理(NLP)方面的伟大成功,DNN在源码分析中日益应用,并吸引软件工程界的极大关注。由于其数据驱动的性质,DNN模型需要大规模和高质量的标签培训数据,才能达到专家级的性能。收集这些数据往往不难,但标签程序却十分困难。基于DNN的代码分析任务甚至使情况恶化,因为源码标签也需要复杂的专业知识。数据增强是一种受欢迎的方法,用以补充计算机视觉和NLPP等领域的培训数据。然而,现有的代码分析中的数据增强方法采用简单的方法,例如数据转换和对抗性格范例生成,从而带来有限的性能优越性。在本文件中,我们提出了一个数据增强方法,旨在有效补充有效的培训数据数据数据,这是由计算机模型中最近命名的Mixupupix。具体地说,我们首先使用多个代码重新定位方法来生成与原始数据一致的代码标签。随后,我们调整了MixC的Mix-rodeal 技术,将原始代码转换为原始代码结构。