Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single-language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN) in a combination with the transcompilers. We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.
翻译:当开发者重新使用代码碎片以在同一或不同的软件系统中实施类似功能时,往往会引入软件的克隆。许多高性能的克隆检测工具如今都以深层次学习技术为基础,主要用于检测用同一编程语言书写的克隆,而探测跨语言克隆的克隆检测工具也正在迅速出现。深层次学习的克隆检测工具的普及使人们有机会调查如何进一步利用已知的战略来提高深层次学习模型的性能,以改善克隆检测工具。在本文中,我们调查这样一种战略,即数据增强,这个战略尚未探索用于跨语言克隆检测,而不是单一语言克隆检测。我们展示了如何使用跨语言的克隆现有知识(从源到源翻译)来增强数据,以提高跨语言克隆检测模型的性能,以及调整单语言的克隆检测模型,通过数据增强跨语言的跨语言的性能。我们用经过培训的源到源的翻译,我们展示了用于跨语言的跨语言检测的跨语言测试模型,我们用直径的跨语言测试模型来扩展了跨语言的直径模型,我们用直径的跨语言的直路路路路路路路路路路路路比的模型。我们用比模型,我们用F的模型,我们用直路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路</s>