Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different deep learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication, and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server). Extensive experiments show that dPRO predicts the performance of distributed training in various settings with < 5% errors in most cases and finds optimization strategies with up to 3.48x speed-up over the baselines.
翻译:利用多种装置(例如,GPUs)广泛采用分布式培训,在大型数据集中学习DNN模型(例如,GPUs),广泛采用分布式培训,在大型数据集中学习DNN模型;然而,大规模分布式培训的绩效往往远非线性加速,鉴于分布式系统的复杂性,在出现意外低培训速度时,查明效率低下的根源和实行有效绩效优化是具有挑战性的;迄今为止,尚没有任何软件工具来分析业绩问题,帮助加快分布式DNN培训,而培训则可以使用不同的深层次学习框架进行;本文件提议DPRO,这是一个工具包,包括:(1) 一个高效的剖析器,收集分布式DNN培训在多个框架中的运行时间痕迹,特别是细微的通信痕迹,并构建全球数据流图,包括详细的通信操作,以便准确重现;(2) 一个优化软件,以有效确定业绩瓶颈,探索优化战略(从计算、通信和记忆方面),以加快培训速度。我们实施了多深层次学习框架(Tensorflow、MXNet)和具有代表性的通信计划(Alled and Parametermetrimed-listations),3。