Graph neural networks (GNNs), an emerging deep learning model class, can extract meaningful representations from highly expressive graph-structured data and are therefore gaining popularity for wider ranges of applications. However, current GNNs suffer from the poor performance of their sparse-dense matrix multiplication (SpMM) operator, even when using powerful GPUs. Our analysis shows that 95% of the inference time could be spent on SpMM when running popular GNN models on NVIDIA's advanced V100 GPU. Such SpMM performance bottleneck hinders GNNs' applicability to large-scale problems or the development of more sophisticated GNN models. To address this inference time bottleneck, we introduce ES-SpMM, a cache-first edge sampling mechanism and codesigned SpMM kernel. ES-SpMM uses edge sampling to downsize the graph to fit into GPU's shared memory. It thus reduces the computation cost and improves SpMM's cache locality. To evaluate ES-SpMM's performance, we integrated it with a popular GNN framework, DGL, and tested it using representative GNN models and datasets. Our results show that ES-SpMM outperforms the highly optimized cuSPARSE SpMM kernel by up to 4.35x with no accuracy loss and by 45.3x with less than a 1% accuracy loss.
翻译:正在形成的深层学习模型类“ 内网” 能够从高度直观的图形结构数据中获取有意义的表达方式,从而获得更广泛的应用范围。然而,即使是使用强大的 GPU, 目前的GNN 操作员的稀薄密度矩阵倍增机制(SpMM)的性能也很差。 我们的分析表明,95%的推论时间可以花在 SpMM 上运行广受欢迎的NNNNN 模型时,而NVIDIA 高级V100 GPU。 这种SpM 性能瓶颈阻碍 GNN 应用大型问题或开发更先进的GNN 模型。为了解决这一推论时间瓶问题,我们引入了ES-SpMM、一个缓存第一边缘取样机制和代码签名的 SpMM 内核。 ES- SpMM 使用边缘取样来缩小图形以适应 GPU的共同记忆。 这样可以降低计算成本并改进SpMMM的缓存地点。 为了评估ES- SpMMMMM的性能比我们高清晰度模型低廉度的S 4. SpMMS 和测试模型低度数据。