Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket
翻译:在图形神经网络(GNN)上全批培训以学习大图形的结构是一个关键问题,需要将大图表的规模扩大到数百个计算节点才可行。由于单计算节点和多个节点通信量高的单个计算节点的记忆能力和带宽要求很大,因此这是一个具有挑战性的问题。在本文件中,我们介绍了DistGNNN, 优化众所周知的深图库(DGL),以便通过高效共享存储实施、使用最小的顶切分图形分区算法减少通信以及使用一个延迟更新的算法组合避免通信。我们在四个通用的 GNNN基准数据集上的结果:Reddit、OGB-Producls、OGB-Papers和Proteins, 显示使用单个 CPU 套接头和最多97x 速度提升速度,分别使用128个 CPU 套接头的基线 DGL 实施速度。