Neural clone detection has attracted the attention of software engineering researchers and practitioners. However, most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset. This results in poor model performance, especially in terms of model recall. In this paper, we present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO, a framework for finding clones in codebases reflecting industry practices. We present three main components: (1) an AST-inspired representation for source code that leverages program structure and semantics, (2) a global graph representation that captures the context of an AST among a corpus of programs, and (3) a graph embedding for programs that, in combination with extant large-scale language models, improves state-of-the-art code clone detection. Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
翻译:神经克隆探测吸引了软件工程研究人员和从业人员的注意,然而,大多数神经克隆探测方法并未超越培训数据集中出现的克隆的范围,造成模型性能差,特别是模型召回方面。在本文中,我们介绍了可通用神经克隆探测的简易语库辅助方法(AST),或ASTRO,这是在反映工业实践的代码库中寻找克隆的框架。我们介绍了三个主要组成部分:(1)由AST启发的源代码代表,该源代码利用了程序结构和语义学,(2)全球图形代表,将AST的背景包含在一系列程序之中,(3)程序图嵌入图集,与现有大规模语言模型相结合,改进最新技术的代码克隆探测。我们的实验结果表明,ASTRO改进了记忆和F-1分数中的最新神经克隆探测方法。