Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and code summarization. A particularly interesting case of clone detection is the detection of semantic clones, i.e., code snippets that have the same functionality but significantly differ in implementation. A promising approach to detecting semantic clones is contrastive learning (CL), a machine learning paradigm popular in computer vision but not yet commonly adopted for code processing. Our work aims to evaluate the most popular CL algorithms combined with three source code representations on two tasks. The first task is code clone detection, which we evaluate on the POJ-104 dataset containing implementations of 104 algorithms. The second task is plagiarism detection. To evaluate the models on this task, we introduce CodeTransformator, a tool for transforming source code. We use it to create a dataset that mimics plagiarised code based on competitive programming solutions. We trained nine models for both tasks and compared them with six existing approaches, including traditional tools and modern pre-trained neural models. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others. Among CL algorithms, SimCLR and SwAV lead to better results, while Moco is the most robust approach. Our code and trained models are available at https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.
翻译:代码克隆是使用类似功能的代码片段的一对。 克隆探测是自动源代码理解的基本分支, 其应用在重构建议、 plagiarism 检测和代码总和中有许多应用。 克隆探测的一个特别有趣的案例是检测语义克隆, 即具有相同功能但在执行过程中差异很大的代码片段。 检测语义克隆的有希望的方法是对比学习( CL), 这是一种在计算机视觉中受欢迎但尚未被普遍采用用于代码处理的机器学习模式 。 我们的工作旨在评估最受欢迎的 CL 算法, 结合两个任务的三个源代码演示。 第一个任务是代码克隆探测, 我们在 POJ- 104 数据集中评估104个算法。 第二个任务是检测语义性克隆。 为了评估这一任务的模式, 我们引入代码 Transformatormatormatororororororormormor, 一个用于转换源代码的工具。 我们使用它来创建基于竞争性编程解决方案的基于 MIC 259 代码。 我们培训了九种模型, 并且将它们与现有的六种模式进行比较的模型进行比较, 包括传统的Smal- rodealalalalalalal 工具, ex exal ex ex ex ex exal ex ex ex