Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker Deep-Perf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed.
翻译:深入学习(DL)已广泛应用于许多领域,工程设计设计(DL)系统的独特挑战来自从传统系统向DL系统的方案编制范式转变,业绩是挑战之一。DL系统的业绩问题(PP)可造成严重的后果,如资源消耗过多和资金损失。虽然DL系统中的错误已经进行了广泛调查,但DL系统中的PP却很少探索。为了缩小这一差距,我们提出了第一份全面研究,以便一)确定症状、根本原因,并在Tensorflow和Keras开发的DL系统中引入和暴露PPP阶段,从210 StackOverpropp中收集了224个P, 评估了现有绩效分析方法处理PP的能力,在DL系统中建立了58个PP的基准。我们的调查结果揭示了开发高性能DL系统以及检测和定位DL系统中的PP的影响。为了表明我们的调查结果的效用,我们开发了一台静态检查器深Perf,从210个StackOverprops中收集了224个,在GIP和GIPS中发现了27个固定项目。