Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized on modern GPU platforms. To this end, we propose the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs. We introduce a novel quantized low-bit arithmetic design based on the low-bit data representation and bit-decomposed computation. We craft a novel TC-tailored CUDA kernel design by incorporating 3D-stacked bit compression, zero-tile jumping, and non-zero tile reuse technique to improve the performance systematically. We incorporate an effective bandwidth-optimized subgraph packing strategy to maximize the transferring efficiency between CPU host and GPU device. We integrate QGTC with Pytorch for better programmability and extensibility. Extensive experiments demonstrate that QGTC achieves up to 1.63x speedup compared with the state-of-the-art Deep Graph Library framework across diverse settings.
翻译:近些年来,量化的图形神经网络(QGNN)因其高度稳健和低计算及记忆管理,吸引了大量研究和行业关注。不幸的是,QGNN在现代GPU平台上从未实现绩效收益。为此,我们提议了第一个基于Tensor Core(TC)的计算框架QGTC,以支持在 GPUs 上对QGNNs的任何比位计算。我们引入了一个新的基于低位数据表示和位分解计算而量化的低位计算设计。我们设计了一个新的TC-定制的 CUDA内核设计,采用了3D-粉碎的位压缩、零瓦跳动和非零瓦再利用技术来系统改进性能。我们采用了一个有效的带宽操作子包装战略,以最大限度地提高 CPU 主机和 GPUP 设备之间的传输效率。我们将QGTC与Pytorch(Pytoch)结合了更好的编程性和可扩展性。广泛的实验表明,QGTOC在州立图书馆上实现了1x速度框架。