Code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. We argue that code similarity is now a first-order problem that must be solved. To begin to address this, we present machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to three state-of-the-art code similarity systems: (i)code2vec, (ii)Neural Code Comprehension, and (iii)Aroma. In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5x to 43.4x better accuracy than all three systems.
翻译:代码相似系统是一系列应用的组成部分,从代码建议到自动软件缺陷校正。 我们争辩说,代码相似性现在是一个必须解决的一阶问题。 为了解决这个问题,我们推出机器 " 引用代码相似性 " (MISIM),这是一个由两个核心部分组成的新型端对端代码相似系统。首先,MISIM使用一种具有上下文意识的新型语义结构,目的是帮助从代码合成法中去除语义含义。第二,MISIM提供了一种基于神经的代码相似性评分算法,可以与各种神经网络结构一起应用,并使用各种有知识的参数。我们将MISIM与三种最先进的代码相似系统进行了比较:(i) 代码2vec, (ii) Neural Code Commission, 和 (iii) Aroma。在我们对328,155个程序(超过1,800万条代码线)的实验性评估中,MISIM比所有三个系统的精确度要高1.5x至43.4x。