Code cloning plays a very important role in open-source software engineering. The presence of clones within a project may indicate a need for refactoring, and clones between projects are even more interesting, since code migration takes place and violations are possible. But how is code being copied? How prevalent is the process and on what level does it happen? In this general study, we attempt to shed some light on these questions by searching for clones in a large dataset of over 23 thousand Java projects on the level of both files and methods, and by studying the code fragments themselves and their clone pairs. We study the size and the age of code fragments, the prevalence of their clones, relationships between exact and non-exact clones, as well as between method-level and file-level clones. We also discover and describe various anomalies in the code clones that we discover. Our research shows that the copying occurs all through the years of the Java code existence and that method-level copying is much more prevalent than file-level copying: only 35.4% of methods have no clones. Additionally, some of the discovered anomalies can be useful for future large-scale cloning research as they can be used for removing auto-generated code.
翻译:克隆代码在开放源码软件工程中起着非常重要的作用。 在一个项目中,克隆人的存在可能表明需要重新设定,项目之间的克隆更有趣,因为代码迁移和违规是可能的。 但是,代码是如何复制的? 代码是如何复制的? 过程及其发生的程度有多普遍? 在这项一般性研究中,我们试图通过在文件和方法层面的23 000多个爪哇项目的大型数据集中搜索克隆,以及通过研究代码碎片本身及其克隆配对,来对这些问题做一些说明。我们研究了代码碎片的大小和年龄、其克隆的流行程度、精确克隆和非精确克隆人之间的关系以及方法层次和文件层次克隆人之间的关系。我们还发现并描述了我们发现的代码克隆人中的各种异常现象。我们的研究显示,在爪哇代码存在多年后,复制过程一直持续着,方法层次的复制比文件层次复制要普遍得多:只有35.4%的方法没有克隆。此外,一些已发现的异常现象可以用于未来大规模克隆研究。