Deep learning (DL) has been increasingly applied to a variety of domains. The programming paradigm shift from traditional systems to DL systems poses unique challenges in engineering DL systems. Performance is one of the challenges, and performance bugs(PBs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PBs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to characterize symptoms, root causes, and introducing and exposing stages of PBs in DL systems developed in TensorFLow and Keras, with a total of 238 PBs collected from 225 StackOverflow posts. Our findings shed light on the implications on developing high performance DL systems, and detecting and localizing PBs in DL systems. We also build the first benchmark of 56 PBs in DL systems, and assess the capability of existing approaches in tackling them. Moreover, we develop a static checker DeepPerf to detect three types of PBs, and identify 488 new PBs in 130 GitHub projects.62 and 18 of them have been respectively confirmed and fixed by developers.
翻译:深度学习(DL)越来越多地应用于各个领域。从传统系统向DL系统的方案拟定范式转变给工程设计DL系统带来了独特的挑战。绩效是挑战之一,DL系统中的性能错误可造成严重的后果,如资源消耗过多和财政损失。虽然对DL系统中的错误进行了广泛调查,但DL系统中的PB很少探索。为了缩小这一差距,我们提出了第一份全面研究,以辨别在Tensorflow和Keras开发的DL系统中的症状、根源、引入和暴露PB阶段,总共从225个StackOverproll 员额收集了238个PB。我们的调查结果揭示了开发高性能DL系统以及检测和定位DL系统中的PB的影响。我们还建立了DL系统中56个PB的第一个基准,并评估了现有方法的处理能力。此外,我们开发了一台静态检查器,以探测三种类型的PB,并确定了130个GitHub项目和18个固定开发商项目中的488个新的PB。