TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.
翻译:TensorFlow 提供的大型 ML/DL 模型的分布式培训需要大规模计算和交流。最常用的分布式培训方法可以分类如下:1) Google远程程序呼叫(GRPC),2,GRPC+X:X=(InfiniBand Verbs,Messpassing界面,GGUDRFect 影响RDMA)和3) 无GRPC:与MPI的Baidu Alleduedu、与MPI的Horovo50d和与NVDIA NCCL的HorovoVard 共享式培训能力。在本文中,我们为包括Piz Daint 系统(Top 500)在内的各种GPUP组群集提供深入的业绩描述和分析。我们进行实验,以获得以下矢量的新洞察工具:(1) 用于DNNFS培训的应用程序级别可调控性,通过DNFSLS 降级工具提高效率,3 用于无GRPC 大规模智能模型和NBSLA的升级, 10型和NCSDRBSD 设计。