Deep learning (DL) has been pervasive in a wide spectrum of nowadays software systems and applications. The rich features of these DL based software applications (i.e., DL software) usually rely on powerful DL models. To train powerful DL models with large datasets efficiently, it has been a common practice for developers to parallelize and distribute the computation and memory over multiple devices in the training process, which is known as distributed training. However, existing efforts in the software engineering (SE) research community mainly focus on issues in the general process of training DL models. In contrast, to the best of our knowledge, issues that developers encounter in distributed training have never been well studied. Given the surging importance of distributed training in the current practice of developing DL software, this paper fills in the knowledge gap and presents the first comprehensive study on developers' issues in distributed training. To this end, we extract and analyze 1,054 real-world developers' issues in distributed training from Stack Overflow and GitHub, two commonly used data sources for studying software issues. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. Based on the results, we suggest actionable implications and research avenues that can potentially facilitate the future development of distributed training.
翻译:深入学习(DL)在当今软件系统和应用的广泛领域十分普遍,这些基于DL的软件应用(即DL软件)的丰富特点通常依赖强大的DL模型。为了高效地培训强大的DL模型,开发者通常的做法是在培训过程中对多种设备进行平行计算和记忆分配,称为分布式培训。然而,软件工程研究界的现有努力主要侧重于培训DL模型一般过程中的问题。相比之下,根据我们的最佳知识,开发者在分布式培训中遇到的问题从未得到很好的研究。鉴于在目前开发DL软件的做法中分布式培训的重要性日益增大,本文填补了知识差距,并介绍了在分布式培训中开发者问题的第一个全面研究。为此,我们从Stack Overplow和GitHub的分布式培训中提取并分析了1,054个真实世界开发者的问题。两个常用的数据源用于研究软件问题。我们建造了一个精细的税制学,由30个类别组成,涉及错误症状的分类,并总结了未来开发结果的潜在分析模式。