Code cloning plays a very important role in open-source software engineering. The presence of clones within a project may indicate a need for refactoring, and clones between projects are even more interesting, since code migration takes place and violations are possible. But how is code being copied? How prevalent is the process and on what level does it happen? In this general study, we attempt to shed some light on these questions by searching for clones in a large dataset of over 23 thousand Java projects on the level of both files and methods, and by studying the code fragments themselves and their clone pairs. We study the size and the age of code fragments, the prevalence of their clones, relationships between exact and non-exact clones, as well as between method-level and file-level clones. We also discover and describe various anomalies in the code clones that were detected in the dataset. Our research shows that the copying occurs all through the years of the Java code existence and that method-level copying is much more prevalent than file-level copying, with only 35.4% of methods having no clones at all. Additionally, some of the discovered anomalies can be useful for future large-scale cloning research as they can be used for removing auto-generated code.
翻译:克隆代码在开放源码软件工程中起着非常重要的作用。 在一个项目中存在克隆人可能表明需要重新设定,项目之间的克隆人甚至更加有趣,因为代码迁移和违规是可能的。 但是,代码是如何复制的? 代码是如何复制的? 过程及其发生程度有多普遍? 在这项一般性研究中,我们试图通过在23 000多个爪哇项目的大型数据集中搜索克隆人,在文档和方法层面搜索23,000多个 Java项目,研究代码碎片本身及其克隆配对。我们研究了代码碎片的大小和年龄、其克隆的流行程度、其克隆人与非精密克隆人之间的关系以及方法层次和文件层次的克隆人之间的关系。我们还发现并描述了数据集中检测到的代码克隆人中的各种异常现象。我们的研究表明,复制过程在爪哇代码存在多年后就一直存在,方法层次复制比文件层次复制要普遍得多,只有35.4%的方法没有克隆人,而且只有35.4%的方法是完全没有克隆人的。此外,一些已发现的克隆人可以用来进行大规模复制。