大型全球氯化萘多节加速 (Multi-node Acceleration for Large-scale GCNs)

Limited by the memory capacity and compute power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) overall, the acceleration of GCNs in MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two observations, we then propose MultiGCN, the first MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4~12x speedup using only 28%~68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. It not only achieves 2.5~8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graphs as opposed to single-node GCN accelerators.

翻译：由存储能力和计算电力所限, Singe- node 图形变速加速器无法在合理的时间内完成 GCN 执行。因此, 大规模 GCN 需要多点加速系统( MultiAccSys ), 如 TPU- Pod, 用于大型神经网络。在这项工作中, 我们的目标是将单节点GCN 加速器提升到大型图形上加速GCN 的多级流速加速器。我们首先确定GCN 在大比例图形上多节点加速执行GCN的通信模式和挑战。我们观察到:(1) 在多亚ccSys 执行GCN 时, 使用大量冗余网络传输和离子存储存储器访问大型神经网络;(2) 总体而言, 多亚体点网络的GCN 加速度向下调离子,但降低大型图形的GCN 。我们随后建议多点网络的多节点快速传输速度和多级网络, 将一个高端网络向高端网络提供一个高端网络。