重新思考培训大型图表的效率和冗余 (Rethinking Efficiency and Redundancy in Training Large-scale Graphs)

Large-scale graphs are ubiquitous in real-world scenarios and can be trained by Graph Neural Networks (GNNs) to generate representation for downstream tasks. Given the abundant information and complex topology of a large-scale graph, we argue that redundancy exists in such graphs and will degrade the training efficiency. Unfortunately, the model scalability severely restricts the efficiency of training large-scale graphs via vanilla GNNs. Despite recent advances in sampling-based training methods, sampling-based GNNs generally overlook the redundancy issue. It still takes intolerable time to train these models on large-scale graphs. Thereby, we propose to drop redundancy and improve efficiency of training large-scale graphs with GNNs, by rethinking the inherent characteristics in a graph. In this paper, we pioneer to propose a once-for-all method, termed DropReef, to drop the redundancy in large-scale graphs. Specifically, we first conduct preliminary experiments to explore potential redundancy in large-scale graphs. Next, we present a metric to quantify the neighbor heterophily of all nodes in a graph. Based on both experimental and theoretical analysis, we reveal the redundancy in a large-scale graph, i.e., nodes with high neighbor heterophily and a great number of neighbors. Then, we propose DropReef to detect and drop the redundancy in large-scale graphs once and for all, helping reduce the training time while ensuring no sacrifice in the model accuracy. To demonstrate the effectiveness of DropReef, we apply it to recent state-of-the-art sampling-based GNNs for training large-scale graphs, owing to the high precision of such models. With DropReef leveraged, the training efficiency of models can be greatly promoted. DropReef is highly compatible and is offline performed, benefiting the state-of-the-art sampling-based GNNs in the present and future to a significant extent.

翻译：大型图表在现实世界情景中是无处不在的, 并且可以由图形神经网络( GNN) 培训这些模型来代表下游任务。因此, 我们提议通过重新思考一个图表的内在特性, 来降低大型图表的冗余性, 从而降低培训效率。不幸的是, 模型的缩放性严重限制了通过 Vanilla GNN 来培训大型图表的效率。尽管最近基于抽样的培训方法有所进步, 以抽样为基础的 GNNN 通常忽略冗余问题。在大型图表中培训这些模型仍然需要无法容忍的时间。因此, 我们提议通过重新思考一个图表中固有的特性, 来降低冗余性, 提高与GReNN的精度。在实验和理论分析中, 我们先提出一个一劳永无缺性的方法, 大幅的GRedelfefe, 开始初步的实验, 来探索大规模模型的冗余性。下一步, 我们用一个数字来量化所有零度的图表中的缩式的缩略图。在实验和理论分析中, 我们用一个高比例的模型来演示, 展示一个高比例的滚动式的模型, 向后, 显示一个高比例的缩式的滚动式的模拟, 向后, 向后, 我们向后, 展示一个高级的滚动。