This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23\%\~{}119\% overall performance compared with Caffe running on K40m GPU. As compared with the Caffe on CPU, swCaffe runs 3.04\~{}7.84x faster on all the networks. Finally, we present the scalability of swCaffe for the training of ResNet-50 and AlexNet on the scale of 1024 nodes.
翻译:本文报告了我们在SwCaffe上的努力,这是在Sunway TaihuLight上加速深神经网络(DNNS)培训的一个高效的平行框架;Sunway TaihuLight是目前世界上最快的超级计算机,采用一个独特的多核心多元结构,有40,960 SW26010 处理器通过定制通信网络连接。首先,我们指出一些有见识的原则,以充分利用创新的多核心结构的绩效。第二,我们提出了一套优化战略,以重新设计基于Cafe的多种神经网络层。第三,我们提出了一个表层-水参数同步计划,以高效率地将同步的蒸汽梯源(SGD)方法推广到多个处理器。我们通过培训各种广泛使用的神经网络和图像网络数据集来评估我们的框架。在单一节点上,Swaffe可以达到23,119 ⁇ 总体性能,而Cafe在K40m GPU上运行。与Cafe相比,我们提出了一套优化战略,SwCaffe运行3.04 ⁇ 7.84x同步,在所有网络上更快地运行3.04 ⁇ 7.84x。最后,我们展示了亚力网络的10-ResNet的可升级能力。