提交2Vec: 修改代码的学习分配表 (Commit2Vec: Learning Distributed Representations of Code Changes)

from arxiv, A previous version of this paper had the following title: "patch2vec: Distributed Representation of Code Changes"; we updated the title to distinguish it from another pre-existing approach with the same name. An improved version of this work appeared in Springer Nature Computer Science, 2021 (https://doi.org/10.1007/s42979-021-00566-z)

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits. Because our method uses transfer learning (that is, we train a network on a "pretext task" for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model. Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset ($>10^6$ samples) were surpassed when pretraining on a smaller dataset ($>10^4$ samples) but for a pretext task that is more closely related to the target task.

翻译：深层次的学习方法在图像分类和自然语言处理等领域中找到了成功的应用,最近在源代码分析中也应用了源代码分析,因为可以免费获得的大量源代码(例如,来自开放源代码库的源代码)。在这项工作中,我们详细阐述了一种最先进的方法来代表使用其合成结构信息的来源代码的表述方法,我们将其调整为代表源变化(即,承诺)。我们使用这种表达方法来区分与安全有关的承诺。由于我们的方法使用转移学习方法(即,我们培训一个“前置任务”的网络,该“前置任务”有丰富的标签数据,然后我们利用这种网络进行承诺分类的目标任务,该目标任务有较少的标签实例。我们用两种不同的借口任务和随机初始模式对网络进行预培训的影响进行了研究。我们的结果表明,通过编码合成符号外表获得的结构信息能够用于基于信号的演示。此外,在以非常大的数据设置(>10-6美元)为标签的、但比目标更小的数据样本培训前的绩效指标(10美元)比目标要高出10美元。