深层学习守则中的克隆:什么、在哪里、为什么? (Clones in Deep Learning Code: What, Where, and Why?)

Deep Learning applications are becoming increasingly popular. Developers of deep learning systems strive to write more efficient code. Deep learning systems are constantly evolving, imposing tighter development timelines and increasing complexity, which may lead to bad design decisions. A copy-paste approach is widely used among deep learning developers because they rely on common frameworks and duplicate similar tasks. Developers often fail to properly propagate changes to all clones fragments during a maintenance activity. To our knowledge, no study has examined code cloning practices in deep learning development. Given the negative impacts of clones on software quality reported in the studies on traditional systems, it is very important to understand the characteristics and potential impacts of code clones on deep learning systems. To this end, we use the NiCad tool to detect clones from 59 Python, 14 C# and 6 Java-based deep learning systems and an equal number of traditional software systems. We then analyze the frequency and distribution of code clones in deep learning and traditional systems. We do further analysis of the distribution of code clones using location-based taxonomy. We also study the correlation between bugs and code clones to assess the impacts of clones on the quality of the studied systems. Finally, we introduce a code clone taxonomy related to deep learning programs and identify the deep learning system development phases in which cloning has the highest risk of faults. Our results show that code cloning is a frequent practice in deep learning systems and that deep learning developers often clone code from files in distant repositories in the system. In addition, we found that code cloning occurs more frequently during DL model construction. And that hyperparameters setting is the phase during which cloning is the riskiest, since it often leads to faults.

翻译：深层学习应用越来越受欢迎。深层学习系统的开发者努力写更有效率的代码。深层学习系统的开发者不断演变,要求更紧的发展时限和越来越复杂,这可能导致设计决定不善。深层学习开发者广泛使用复制版纸版方法,因为他们依赖共同的框架和重复类似的任务。开发者往往无法在维护活动期间适当传播所有克隆碎片的变化。根据我们的知识, 没有研究深层学习发展中的代码克隆做法。鉴于对传统系统的研究中报告的克隆人对软件质量的负面影响, 深层克隆系统的特性和潜在影响非常重要, 要了解代码克隆对深层学习系统的影响和潜在影响。为此,我们使用NiCad 工具检测59 Python、14 C# 和 6 Java 深层学习系统和同等数量的传统软件系统。然后我们分析在深层学习和传统系统中的代码的频率和分布。我们进一步分析使用基于位置的克隆的代码的分布模式。我们还研究错误和代码之间的关联性关系, 来评估克隆系统在深层克隆系统中学习最高级的代码。最后,我们学习了一个与我们学习的代码的系统。