Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.
翻译:代码克隆是( 近距离) 类似语法或语义学的重复代码碎片。 代码克隆检测在软件维护、 代码重新设定和再利用方面起着重要作用。 过去已经进行了大量研究来检测克隆人。 这些方法大多使用词汇和合成信息来检测克隆人。 但是, 只有少数这些方法是针对语义克隆人的。 最近, 研究人员由于在其他领域, 包括自然语言处理和计算机视觉的深层次学习模型的成功, 试图采用深层次的学习技术来检测代码克隆人。 这些方法使用词汇信息( 原子) 和( 或) 合成结构, 如抽象的词汇树( ASTs) 来检测代码克隆人。 然而, 它们没有充分利用现有的结构和语义信息, 因而限制了它们的能力。 本文解决了使用程序依赖性图表和测深的神经网络来检测语义编码的问题, 利用结构精密的合成和语义信息来检测代码。 我们开发了一个原型的 HOLMES (TLCD) 和实验性结构模型, 以我们的HOL 模型模型模型模型模型模型模型模型为基础, 也更好地评估了其他的模型, 。