Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.
翻译:分布式深神经网络(DDNN)培训是一个越来越重要的工作量,经常在云层中运行。大型DNN模型和快速计算引擎正在将DDNN培训瓶颈从计算转向通信。本文将DDNN培训的特点描述为准确确定这些瓶颈。我们发现,及时培训需要高性能参数服务器,优化网络堆和梯度处理管道,以及服务器和网络硬件,平衡计算和通信资源。因此,我们提议PHub,高性能多耗时、分级PS设计。PHub共同设计了PS软件和硬件,以加速与许多DNNN培训框架兼容的ACI级和分级跨架参数交换。PHub提供最高达2.7x的性能改进,而最先进的基于云的图像网络工作量培训技术则提供最新水平的改进,每美元25%的吞吐量更好。