Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
翻译:事实证明,在广泛的应用领域,平面神经网络(GNN)是一个强大的算法模型,能够有效地通过图示学习。为了扩大GNN培训的规模,大规模和不断增长的图表,最有希望的解决办法是分配培训,通过多个计算节点分配培训工作量。然而,分布式GNN培训的工作流程、计算模式、通信模式和优化技术仍然初步理解。在本文件中,我们通过调查分布式GNN培训中使用的各种优化技术,对分布式GNN培训进行了全面调查。首先,根据他们的工作流程将分布式GNN培训分为若干类别。此外,还介绍了其计算模式和通信模式,以及最近工作提出的优化技术。第二,还引入了分布式GNN培训的软件框架和硬件平台,以加深理解。第三,将分布式GNNN培训与分布式的深层神经网络培训进行比较,强调分布式GNN培训的独特性。最后,讨论了该领域的有趣问题和机会。