Software clone detection identifies similar code snippets. It has been an active research topic that attracts extensive attention over the last two decades. In recent years, machine learning (ML) based detectors, especially deep learning-based ones, have demonstrated impressive capability on clone detection. It seems that this longstanding problem has already been tamed owing to the advances in ML techniques. In this work, we would like to challenge the robustness of the recent ML-based clone detectors through code semantic-preserving transformations. We first utilize fifteen simple code transformation operators combined with commonly-used heuristics (i.e., Random Search, Genetic Algorithm, and Markov Chain Monte Carlo) to perform equivalent program transformation. Furthermore, we propose a deep reinforcement learning-based sequence generation (DRLSG) strategy to effectively guide the search process of generating clones that could escape from the detection. We then evaluate the ML-based detectors with the pairs of original and generated clones. We realize our method in a framework named CloneGen. CloneGen In evaluation, we challenge the two state-of-the-art ML-based detectors and four traditional detectors with the code clones after semantic-preserving transformations via the aid of CloneGen. Surprisingly, our experiments show that, despite the notable successes achieved by existing clone detectors, the ML models inside these detectors still cannot distinguish numerous clones produced by the code transformations in CloneGen. In addition, adversarial training of ML-based clone detectors using clones generated by CloneGen can improve their robustness and accuracy. CloneGen Meanwhile, compared with the commonly-used heuristics, the DRLSG strategy has shown the best effectiveness in generating code clones to decrease the detection accuracy of the ML-based detectors.
翻译:克隆软件的检测发现类似代码片段。 这是一个活跃的研究课题,在过去二十年中吸引了广泛的关注。 近些年来, 机器学习( ML) 探测器, 特别是深层学习的探测器, 显示了在克隆检测方面的令人印象深刻的能力。 这个长期的问题似乎已经由于 ML 技术的进步而被驯服了。 在这项工作中, 我们想通过代码语义保存变换, 挑战最近基于 ML 的克隆探测器的坚固性。 我们首先使用15个简单的代码转换操作器, 再加上常用的超常性理论( 例如, 随机搜索, 遗传性 Algorithm 和 Markov 链 Monte Car) 来进行相应的程序变换。 此外, 我们提议了一个深强化基于学习的序列生成( DRLSG ) 战略, 以有效指导生成可能逃脱检测的克隆人的搜索过程。 我们然后用原始和生成的克隆模型来评估基于ML的克隆的探测器。 我们通过一个名为克隆的 Cloone Gen 的变现框架来改进我们的方法。 。 在评估中, 我们挑战了两个状态的DNA内部的变异性变异性变变变的, ML