We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.
翻译:我们的目标是通过引入一个全面的分布式深度学习(DDL)剖面仪来解决这一问题,该剖面仪可以确定DDL在公共云层运行时遇到的各种执行“摊位”情况。我们已经实施了剖面仪,将先前的工作扩大到额外估计两种类型的通信摊位——互连和网络摊位。我们用剖面仪来培训流行的DNN模型,以描述AWS GPU的各种实例,并列出其优缺点,供用户作出知情决定。我们发现,对于所有DNN模型来说,费用更高的GPU实例可能不是最优秀的,而AWS可能不尽善地分配硬件互联资源。具体地说,机器内部连接可以引入高达90%的DNN培训时间和网络连接案例的通信管理费,比单例的培训速度可能减速5x。此外,我们模拟DNN宏观特征的影响,例如层数和对通信摊位的梯度数。最后,我们提出了一个基于测量的建议模型,供用户降低DNNTL的公共云面货币成本,并视时间预算而定。