Deep Neural Networks are capable of learning highly generalizable patterns from large datasets of source code through millions of parameters. This large capacity also renders them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as code from GitHub. Given the sheer size of such corpora, determining the role and extent of noise in these is beyond manual inspection. In this paper, we propose an alternative analysis: we evaluate the impact of the noise on training neural models of source code by introducing targeted noise to the dataset of several state-of-the-art neural intelligence models and benchmarks based on Java and Python codebases. By studying the resulting behavioral changes at various rates of noise, and across a wide range of metrics, we can characterize both typical generalizing and problematic memorization-like learning of models of source code. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. At the same time, the metrics used to analyze this phenomenon proved surprisingly useful for detecting and quantifying such effects, offering a powerful toolset for creating reliable models of code.
翻译:深心神经网络能够从大量源代码数据集中通过数以百万计的参数,从大量源代码数据集中学习非常普遍的模式。 这种巨大的能力也使得它们容易被记忆化数据点。 最近的工作表明,当培训数据集吵闹,涉及许多模糊或可疑的样本,以及记忆化是唯一的办法时,记忆化风险就特别明显。 不幸的是,大多数代码情报任务都依赖于相当容易噪音和重复的数据源,例如来自GitHub的代码。鉴于这类公司的规模巨大,确定这些公司中噪音的作用和程度超出了人工检查的范围。我们在此文件中提出一个替代分析:我们评估噪音对源代码神经神经模型培训的影响,方法是在以爪哇和Python代码库为基础的若干最新神经智能模型和基准的数据集中引入有针对性的噪音。通过研究由此产生的各种噪音率和各种有用的测量方法,我们可以将典型的通用和有问题的混杂的混杂模式与学习任何源代码一样。我们的结果突出表明,在制作一个可靠的模型时,可以提供一种可靠的数据分析工具,包括用于这种精确度的精确度分析的精确度。