Graph neural networks (GNNs) are a type of deep learning models that learning over graphs, and have been successfully applied in many domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques from graph processing to distributed execution. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol.We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on scalable GNNs.
翻译:图表神经网络(GNNs)是一种深层次的学习模式,通过图表学习,并成功地应用于许多领域。尽管GNNs取得了成效,但GNNs仍然难以有效地将GNs推广到大图表。作为一种补救措施,分布式计算成为培训大规模GNNs的有希望的解决办法,因为它能够提供丰富的计算资源。但是,图形结构的依赖增加了实现高效分布式GNN培训的难度,这种培训受到大规模通信和工作量不平衡的影响。近年来,在分布式GNN培训方面作出了许多努力,并提出了一系列培训算法和系统。然而,从图表处理到分布式图表执行,对优化技术缺乏系统的审查。在这次调查中,我们分析了分布式GNNN培训的三大挑战,即大规模通信、模型准确性损失和工作量不平衡。随后,我们引入了一种新的分类方法,用于分配式GNNNN培训的优化技术,以应对上述挑战。新的分类将现有技术分为四类,即GNNNN数据共享、G批量生成、GNNNG执行模型、GG GNG GG GG GG GG Gs Gs Gs missums 分别在最后讨论技术中仔细中进行认真讨论。