This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Accelerating GNNs for particle detectors is challenging since it requires sub-microsecond latency to deploy the networks for online event selection in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outer-product based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.
翻译:这项工作建议为低悬浮图像神经网络(GNN)专门设计用于粒子探测器的新型重构结构。 加速粒子探测器 GNNS 具有挑战性, 因为它需要低微秒的延迟度, 在 CERN 大型 Hadron 相撞器实验中部署一级触发的在线事件选择网络。 本文建议为基于互动网络的GNNS 的矩阵倍增操作进行定制代码转换, 减少强度, 并配有完全连通的图形, 避免成本倍增。 它利用了宽度模式以及双相匹配矩阵, 避免了不规则的内存访问, 导致延度下降和硬件效率的提高。 此外, 我们引入了基于外产产品的基矩阵倍倍增法, 从而进一步降低设计。 此外, GNNNE 特定的算法- 硬体软件的二次共置换式组合方法, 不仅能从低度设计到低度, 也避免了6级的内装质访问, 降低内存度的内存率, 降低内存时间的内存率, 调调调调调调调调调调调调时间, 调的GGGGGDRBRDRDRDRDRDRDRD, 也使得高调调能显示高调的GGFP RDFD RD RD RD RD RD RD 。