Training Graph Neural Networks (GNN) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources where the colocation is infeasible. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training, to better resource utilization, improve execution pipelining, and expediting training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 67% training speed-up with our algorithm as compared to representative baselines.
翻译:大型图表培训神经网络(GNN) 分布式的GNN培训与分布式深度神经网络(DNN) 培训不同,分布式GNN培训的瓶颈主要在于用于建造小型培训样品的大型图表数据传输。现有解决方案往往提倡数据计算合用,在无法将数据合用同一地点的有限资源下效果不佳。没有很好地探索战略任务安排和数据传输及任务执行的最佳时间安排的可能性。我们设计了一个高效的算法框架,用于分配式GNN培训的任务安排和执行安排,以便更好地利用资源,改进执行管道,加快培训完成速度。我们的框架包括两个模块:(一) 安排培训任务执行时间表的在线时间安排算法,以及数据传输计划;(二) 确定每项培训任务安排的探索性任务安排计划。我们进行了彻底的理论分析、测试性实验性实验和模拟性实验性研究,并观察了我们67%的培训速度基准的比较。