Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii)MISIM uses an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems, including two additional hand-customized models, over 328K programs consisting of over 18 million lines of code. Our experiments show that MISIM has 8.08% better accuracy (using MAP@R) compared to the next best performing system.
翻译:代码语义相似性可用于许多任务,例如代码建议、自动软件缺陷校正和克隆检测。 然而,这些系统的准确性尚未达到一般目的可靠性的水平。 为了解决这一问题,我们提出一个神经代码语义相似性系统,由两个核心部分组成:(i) MISIM使用一种新的符合背景的语义结构,目的是将语义从代码语义中去除;(ii) MISIM使用一种可扩展的神经代码相似性评分算法,可用于各种具有学习参数的神经网络结构。我们将MISIM比作四个最先进的系统,包括另外两个手定制的模型,超过由1 800万行代码组成的328K程序。我们的实验表明,MISIM比下一个最佳运行系统精准8.08%(使用MAP@R)。