RSC:通过随机化零散计算法加速图表神经网络培训 (RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations)

The training of graph neural networks (GNNs) is extremely time consuming because sparse graph-based operations are hard to be accelerated by hardware. Prior art explores trading off the computational precision to reduce the time complexity via sampling-based approximation. Based on the idea, previous works successfully accelerate the dense matrix based operations (e.g., convolution and linear) with negligible accuracy drop. However, unlike dense matrices, sparse matrices are stored in the irregular data format such that each row/column may have different number of non-zero entries. Thus, compared to the dense counterpart, approximating sparse operations has two unique challenges (1) we cannot directly control the efficiency of approximated sparse operation since the computation is only executed on non-zero entries; (2) sub-sampling sparse matrices is much more inefficient due to the irregular data format. To address the issues, our key idea is to control the accuracy-efficiency trade off by optimizing computation resource allocation layer-wisely and epoch-wisely. Specifically, for the first challenge, we customize the computation resource to different sparse operations, while limit the total used resource below a certain budget. For the second challenge, we cache previous sampled sparse matrices to reduce the epoch-wise sampling overhead. Finally, we propose a switching mechanisms to improve the generalization of GNNs trained with approximated operations. To this end, we propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations. In practice, rsc can achieve up to $11.6\times$ speedup for a single sparse operation and a $1.6\times$ end-to-end wall-clock time speedup with negligible accuracy drop.

翻译：图表神经网络(GNNs)的培训耗时非常多,因为以图表为基础的少数操作很难用硬件加速。先前的艺术探索了计算精度的交换方法,以便通过抽样近似来降低时间复杂性。根据这个想法,先前的工作成功地加快了密集的矩阵操作(例如,混凝土和线性),精确度下降微不足道。然而,与密集的矩阵不同, 分散的矩阵储存在不规则的数据格式中, 使得每个行/ 列的分录数量可能不同。因此, 与密集的对口相比, 接近的稀释操作有两种独特的挑战:(1) 我们无法直接控制粗略操作的效率, 因为计算仅仅在非零位条目上进行; (2) 由于不规则的数据格式, 亚缩的稀释矩阵效率要低得多。为了解决问题, 我们的关键思想是控制精度贸易贸易效率, 优化资源配置层和粗略的分级。具体地说,我们将计算资源定制为不同的稀释操作,同时将使用的总资源减少到低于某些预算。最后,我们用的是, 将试算方法的缩缩缩缩的缩的缩缩缩缩缩缩的操作。