Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks. We propose PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups, and PBox, a balanced, scalable central PS hardware that fully utilizes PHub capabilities. We show that in a typical cloud environment, PHub can achieve up to 3.8x speedup over state-of-theart designs when training ImageNet. We discuss future directions of integrating PHub with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.
翻译:深层次学习系统社区的大多数工作都集中在更快的推论上,但到达一个经过培训的模式需要长期的实验。 加速培训让开发者能够更快速地循环,并找到更好的模型。 DNN 培训通常被视为一个计算问题, 最好在一个大型计算节点中完成, 并且有许多 GPU 。 随着 DNN 的扩大, 培训需要分布。 分布式的深神经网络( DDNNN) 培训是云层上的重要工作量。 大的 DNNN 模型和更快的计算式引擎将培训的瓶颈从计算转向通信。 我们的实验显示现有的 DNN培训框架在典型的云层环境中不会由于带宽和低效率参数服务器软件堆放而扩大规模。 我们建议PHub, 高性能参数服务器(PPS) 软件设计, 提供优化的网络堆放和简化的梯度处理管道, 以惠及共同的 PS 配置, PBox, 平衡、可扩展的中央PS 硬件, 充分利用 PHub 能力。 我们显示, 在典型的云环境下, PHub 可以在州- 的移动中实现3.的升级, 当培训时, 将数据网络的升级, 和升级的升级, 将数据转换为数据网络的升级。