Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in many real-world applications. Recently, a GNN design principle of model depth-receptive field decoupling has been proposed to address the well-known issue of neighborhood explosion. Decoupled GNN models achieve higher accuracy than original models and demonstrate excellent scalability for mini-batch inference. We map Decoupled GNNs onto CPU-FPGA heterogeneous platforms to achieve low-latency mini-batch inference. On the FPGA platform, we design a novel GNN hardware accelerator with an adaptive datapath denoted Adaptive Computation Kernel (ACK) that can execute various computation kernels of GNNs with low-latency: (1) for dense computation kernels expressed as matrix multiplication, ACK works as a systolic array with fully localized connections, (2) for sparse computation kernels, ACK follows the scatter-gather paradigm and works as multiple parallel pipelines to support the irregular connectivity of graphs. The proposed task scheduling hides the CPU-FPGA data communication overhead to reduce the inference latency. We develop a fast design space exploration algorithm to generate a single accelerator for multiple target GNN models. We implement our accelerator on a state-of-the-art CPU-FPGA platform and evaluate the performance using three representative models (GCN, GraphSAGE, and GAT). Results show that our CPU-FPGA implementation achieves $21.4-50.8\times$, $2.9-21.6\times$, $4.7\times$ latency reduction compared with state-of-the-art implementations on CPU-only, CPU-GPU and CPU-FPGA platforms.
翻译:图形神经网络(GNNs)的微小分解是许多现实世界应用中的一个关键问题。 最近,为了解决众所周知的邻里爆炸问题,提出了模型深度感应场脱钩的GNN设计原则。 脱钩的GNN模型的精确度高于原始模型,并展示了小批量推断的精确度。 我们在CPU-FPGA 混合平台上绘制了脱钩的GNNs GNs, 以达到低纬度微型分解。 在FPGA平台上,我们设计了一个新型的GNN硬件加速器,配有适应性数据路标的调色点调整调色点的调色点。 高压的GNNNs模型显示了密度计算内核内核内核的倍增倍, ACCFC是完全本地连接的组合体阵列, 使用分散式平台的范式, 并同时在多条线平台上设计一个用于支持图表的不规则的CNNNNC-AT调价点连接性连接。 高的C-GS