Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.
翻译:使用多种设备(即, GPU 服务器)进行分布式培训,以在大型数据集中学习 DNN 模型;然而,大规模分布式培训的绩效往往远非在线速度的实际速度。鉴于分布式系统的复杂性,在出现意想不到的低培训速度时,查明效率低下的根源和行使有效的绩效优化是具有挑战性的。到目前为止,还没有软件工具来分析绩效问题并帮助加快分布式DNN培训,而培训可以使用不同的机器学习框架来运行。本文提议DPRO,这是一个工具包,包括:(1) 一个高效的配置器,收集分布式DNN培训在多个框架的运行时间痕迹,特别是细微的通信跟踪,并构建全球数据流图,包括详细的通信操作,以便准确重播;(2) 一个优化器,有效确定绩效瓶颈,探索优化战略(从计算、通信和记忆方面),以加快培训速度。我们实施了多深度学习框架(PyTorrch, TensorFlow, MXNet)和具有代表性的通信计划(Alledive and Paramedrol) imal asublegressolutions asultations overs of asublistraplistraplistraplistraplistraplistrations daslevations.