Deep Neural Networks (DNN) are increasingly commonly used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, training DNNs means walking a knife's edges, because their large capacity also renders them prone to memorizing data points. While traditionally thought of as an aspect of over-training, recent work suggests that the memorization risk manifests especially strongly when the training datasets are noisy and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as GitHub, which, due to their sheer size, cannot be manually inspected and evaluated. We evaluate the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset. In addition to reinforcing prior general findings about the extent of memorization in DNNs, our results shed light on the impact of noisy dataset in training.
翻译:深神经网络(DNN)在软件工程和代码智能任务中日益被广泛使用,这些是能够从大型数据集到数百万参数中学习高度通用模式的有力工具。与此同时,培训DNNS意味着走刀边缘,因为其巨大的能力也使他们容易对数据点进行记忆化。虽然传统上被认为是过度培训的一个方面,但最近的工作表明,当培训数据集吵闹和记忆化是唯一的求助手段时,记忆化风险就显得特别明显。不幸的是,大多数代码情报任务都依赖于噪音易发性和重复性的数据源,如GitHub,由于它们的规模巨大,因此无法进行人工检查和评估。我们通过对几个基准和模范家庭进行案例研究,评估神经编码情报模型中的记忆化和概括化趋势,利用其他使用DNP的领域的既定方法,例如将有针对性的噪音引入培训数据集。此外,除了加强以前对DNPS的记忆化程度的一般性发现外,我们还通过培训中热度数据的影响来评估我们的结果。