推动源码学习与数据增强:经验研究</s> (Boosting Source Code Learning with Data Augmentation: An Empirical Study)

The next era of program understanding is being propelled by the use of machine learning to solve software problems. Recent studies have shown surprising results of source code learning, which applies deep neural networks (DNNs) to various critical software tasks, e.g., bug detection and clone detection. This success can be greatly attributed to the utilization of massive high-quality training data, and in practice, data augmentation, which is a technique used to produce additional training data, has been widely adopted in various domains, such as computer vision. However, in source code learning, data augmentation has not been extensively studied, and existing practice is limited to simple syntax-preserved methods, such as code refactoring. Essentially, source code is often represented in two ways, namely, sequentially as text data and structurally as graph data, when it is used as training data in source code learning. Inspired by these analogy relations, we take an early step to investigate whether data augmentation methods that are originally used for text and graphs are effective in improving the training quality of source code learning. To that end, we first collect and categorize data augmentation methods in the literature. Second, we conduct a comprehensive empirical study on four critical tasks and 11 DNN architectures to explore the effectiveness of 12 data augmentation methods (including code refactoring and 11 other methods for text and graph data). Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning, including those based on mixup (e.g., SenMixup for texts and Manifold-Mixup for graphs), and those that slightly break the syntax of source code (e.g., random swap and random deletion for texts).

翻译：通过使用机器学习解决软件问题,正在推动下一个程序理解时代的下一个时代。最近的研究显示,源代码学习取得了令人吃惊的结果,源代码学习将深神经网络(DNNs)应用于各种关键的软件任务,例如,错误检测和克隆检测。这一成功在很大程度上归功于使用大量高质量的培训数据,而在实践中,数据增强(这是用来产生额外培训数据的一种技术)在诸如计算机愿景等不同领域被广泛采用。然而,在源代码学习中,数据增强没有进行广泛研究,而现有做法仅限于简单的节制方法,例如代码重新设定。基本上,源代码通常以两种方式体现,即按顺序作为文本数据,作为图表数据数据数据数据,当它被用作源代码学习的培训数据时,我们用这些类比关系,我们先开始调查最初用于文本和图表学习的增强数据方法是否对提高源代码的培训质量有效(为此,我们首先收集并分类数据增强数据更新方法,包括数据转换到数据升级系统的数据格式的系统。我们用一个全面的经验性数据数据数据分析方法,然后再用于第11号数据库中的数据转换方法,然后再为数据升级系统进行数据采集。</s>