Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized on modern GPU platforms. To this end, we propose the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs. We introduce a novel quantized low-bit arithmetic design based on the low-bit data representation and bit-decomposed computation. We craft a novel TC-tailored CUDA kernel design by incorporating 3D-stacked bit compression, zero-tile jumping, and non-zero tile reuse technique to improve the performance systematically. We incorporate an effective bandwidth-optimized subgraph packing strategy to maximize the transferring efficiency between CPU host and GPU device. We integrate QGTC with Pytorch for better programmability and extensibility. Extensive experiments demonstrate that QGTC achieves on average 2.7x speedup compared with the state-of-the-art Deep Graph Library framework across diverse settings.
翻译:近些年来,量化的图形神经网络(QGNN)因其高度稳健和低计算及记忆管理,吸引了大量研究和产业关注。不幸的是,QGNN在现代GPU平台上从未实现绩效收益。为此,我们提议了第一个基于Tensor Core(TC)的计算框架QGTC,以支持在 GPUs 上对QGNNs的任何比位计算。我们引入了一个新的基于低位数据表示和位分解计算而量化的低位计算设计。我们设计了一个新的TC-定制的 CUDA内核设计,纳入了3D-粉碎的位压缩、零瓦跳动和非零瓦再利用技术,以系统地改进性能。我们引入了有效的带宽操作子组合组合策略,以最大限度地提高 CPU 主机和 GPUP 设备之间的传输效率。我们将QGTC与Pytorch(Pytorch) 整合,以提高编程性和可扩展性。我们进行了广泛的实验,展示了QGTOC在平均2.7级的州立图图式结构上实现了对比。